# Dataset Description

 **Dataset: Laptop Specs and Latest Price By Santosh Kumar (2022)**

**Brief Overview**

**The dataset illustrates the factors that influence the prices of laptops**

* Several different factors can affect laptop computer prices. These factors include the brand of computer and the number of options and add-ons included in the computer package. In addition, the amount of memory and the speed of the processor can also affect pricing. Though less common, some consumers spend additional money to purchase a computer based on the overall “look” and design of the system.

* In many cases, name brand computers are more expensive than generic versions. This price increase often has more to do with name recognition than any actual superiority of the product. One major difference between name brand and generic systems is that in most cases, name brand computers offer better warranties than generic versions. Having the option of returning a computer that is malfunctioning is often enough of an incentive to encourage many consumers to spend more money.

* Functionality is an important factor in determining laptop computer prices. A computer with more memory often performs better for a longer time than a computer with less memory. In addition, hard drive space is also crucial, and the size of the hard drive usually affects pricing. Many consumers may also look for digital video drivers and other types of recording devices that may affect the laptop computer prices.

* Most computers come with some software pre-installed. In most cases, the more software that is installed on a computer, the more expensive it is. This is especially true if the installed programs are from well-established and recognizable software publishers. Those considering purchasing a new laptop computer should be aware that many of the pre-installed programs may be trial versions only, and will expire within a certain time period. In order to keep the programs, a code will need to be purchased, and then a permanent version of the software can be downloaded.

* Many consumers who are purchasing a new computer are buying an entire package. In addition to the computer itself, these systems typically include a monitor, keyboard, and mouse. Some packages may even include a printer or digital camera. The number of extras included in a computer package usually affects laptop computer prices.

* Some industry leaders in computer manufacturing make it a selling point to offer computers in sleek styling and in a variety of colors. They may also offer unusual or contemporary system design. Though this is less important to many consumers, for those who do value “looks,” this type of system may be well worth the extra cost.

**Collection Process**

The data was gathered from flipkart.com in which an automated chrome web extension tool called Instant Data Scrapper was utilized. Instant Data Scraper extracts data from web pages and exports it as Excel or CSV files. It is an automated data extraction tool for any website. It utilizes an AI to predict which data is most relevant on a HTML page. 

The tool does not require scripts. It uses heuristic AI analysis of HTML structure to detect data for extraction. If the prediction is not satisfactory, it lets the user customize the selections for greater accuracy. Data scraping works just as well for small to much larger and known website such as Amazon.

Since the data is solely relying on listings from flipkart.com, it does not capture the total representative of laptops being sold through different means or channels. 
Because of this, the prices in which these laptops are being sold at may be higher or lower depending on the seller. Moreover, the packages included in the purchase such as the warranty, software installed, and other peripherals may or may not be greater than if it were bought through another channel. Moreover it is an Indian ecommerce website. Because of this, the laptop prices gathered are following their usual prices and also influenced by the Indian rupees. Therefore, generated insights and conclusions may not be fully applicable for every laptop that is being sold around the world.  

**Structure of the Dataset**

The dataset presents different laptops and their characteristics. Each row is equivalent to a single laptop that is listed by a seller on flipkart.com. There are 896 rows on the dataset which means that the number of laptops gathered for the dataset is precisely 896 aswell. On the other hand, there are 23 columns on the dataset. Each column represents a specific attribute or variable of a laptop. 

**Description of each variable**
* **brand**- refers to how the laptop is publicly distinguished from those manufactured by other companies
* **model**- the particular version or design of a laptop from a given manufacturer
* **processor_brand**- refers to how the laptop's processor is publicly distinguished from those manufactured by other companies
* **processor_name**- refers to the specific processor of the laptop
* **processor_gnrtn**- refers to the generation number of the laptop's processor
* **ram_gb**- refers to the total RAM of the laptop in GB
* **ram_type**- refers to the type of the laptop's RAM
* **ssd**- refers to the capacity of the laptop's SSD
* **hdd**- refers to the capacity of the laptop's HDD
* **os**- refers to the operating system of the laptop
* **os_bit**- determines whether the OS is 32-bit or 64-bit
* **graphic_card_gb**- refers to the video memory of the laptop in GB
* **weight**- refers to the weight classification of the laptop
* **display_size**- refers to the diagonal length of the laptop's screen in inches
* **warranty**- refers to the warranty in years
* **touchscreeen**- determines whether or not the laptop has touchscreen capabilities
* **msoffice**- if the laptop comes preinstalled with msoffice
* **latest-price**- latest price in Indian Rupees
* **old_price**- original price in Indian Rupees
* **discount**- the discount in % applied to the old_price
* **star_rating**- refers to the average star rating to the product wherein 0 is the lowest and 5 is the highest
* **ratings**- refers to the total number of ratings given by buyers
* **reviews**- refers to the total number of reviews written by buyers

## Data Cleaning

Among all the variables, only one variable, which is namely: "processor_name", showed major inconsistencies with its data representation. Some cells were quite ambiguous or too broad with its naming. An example of this is a processor name of "Ryzen" and "Core". The issue here is that there are too many cpu's under those categories. Because of this, those observations are going to be dropped. From a total of 896 observations, the dataset was reduced to 881 observations after cleaning the variable. Moving on to the lesser issues is the formatting of the variable "ram_gb". Each one is accompanied by 2 "GB" following the value. An example of this is "4 GB GB", " 8 GB GB" and so on. This is not as much of an issue since the whole dataset is formatted in a similar way. However, it is still modified in this notebook to make the data more elegant. Lastly, some values of the variable "old_price" are set to 0. This occurs whenever no discount is available. Because of this, its price had always been the value of "latest_price". However to avoid confusion it is better to set the value equal to the "latest_price" as well which is what has been done in the notebook. 

In [None]:
import numpy as py
import pandas as pd
import matplotlib.pyplot as plt
laptop_df = pd.read_csv("Laptop_data.csv")
laptop_df.head(10)

In [None]:
laptop_df.info()

In [None]:
laptop_df['processor_name'].value_counts()

In [None]:
laptop_df.shape

In [None]:
clean_df = laptop_df
clean_df = laptop_df[(laptop_df.processor_name != "Ryzen") & (laptop_df.processor_name != "Core") & (laptop_df.processor_name != "GEFORCE RTX") & (laptop_df.processor_name != "GeForce GTX") & (laptop_df.processor_name != "GeForce RTX") & (laptop_df.processor_name != "Quad") & (laptop_df.processor_name != "Ever Screenpad") & (laptop_df.processor_name != "Genuine Windows")]
clean_df.shape

In [None]:
clean_df['processor_name'].value_counts()

In [None]:
clean_df['ram_gb'].unique()

In [None]:
clean_df['ram_gb'] = clean_df['ram_gb'].replace({'4 GB GB':'4 GB', '8 GB GB':'8 GB', '32 GB GB':'32 GB', '16 GB GB':'16 GB'})
clean_df['ram_gb'].unique()

In [None]:
clean_df['old_price'].value_counts()

In [None]:
clean_df.loc[clean_df['old_price'] == 0,'old_price'] = clean_df.latest_price
clean_df.loc[clean_df['old_price'] == clean_df.latest_price,'old_price'].shape

In [None]:
clean_df.to_csv('laptop_cleaned.csv')

**Exploratory Data Analysis**

 **1) What is the relationship between the star rating of the laptops and its price?**

Let us get all of the ratings in the dataset.

In [None]:
clean_df["star_rating"].value_counts()

We see that there are laptops which have a rating of 0.0.

Next, we filter out the laptops with a rating of 0.0 to prevent its effect on the mean.

In [None]:
rating_df = clean_df[(clean_df.star_rating != 0.0)]

rating_df["star_rating"].value_counts()

We can see that there are no longer ratings of 0.0 in the dataset.

Next, we will visualize the dataset by grouping the dataset based on the ratings. Afterwards, we get the mean of the latest_price for each rating.

In [None]:
graphrating_df = rating_df.groupby("star_rating").agg({"latest_price": ["mean"]})

In [None]:
graphrating_df = graphrating_df.squeeze()
graphrating_df
graphrating_df.plot.barh(figsize=(12,14)).invert_yaxis()
plt.ylabel('Rating')
plt.xlabel('Price (Rupees)')
plt.title('Laptop Price For Each Rating')

The horizontal graph above shows the mean price of laptops given a specific star rating. The most expensive laptops comes with the ratings of 4.8, 5.0 and 1.7. 

In [None]:
corr_df = rating_df[['star_rating','latest_price']]
corr_df.corr()

The correlation between star_rating and latest price is 0.301293. This shows that star_rating and latest price of the laptops have a positive relationship, which means that as star_rating increases, the latest_price also increases.

Let us visualize the correlation through a scatter plot.

In [None]:
corr_df.plot.scatter(x="star_rating", y="latest_price", alpha=0.5)

We can see that generally the price  goes higher as the ratings go higher.

**2) What is the average rating of laptops for the most used processor brand vs the second most used processor brand?**

Let us first get the processor brands in the dataset.

In [None]:
clean_df['processor_brand'].value_counts()

Afterwards, we need to remove laptops that does not use AMD or Intel for its processor.

In [None]:
pbrand_df = clean_df[(clean_df.processor_brand == "Intel") | (clean_df.processor_brand == "AMD")]
pbrand_df

Afterwards, we check all the star_ratings in the dataset.

In [None]:
pbrand_df["star_rating"].value_counts()

We can see that there are laptops that have a rating of 0.0. 

We will remove it to prevent its effect on the mean.

In [None]:
pbrand_df = pbrand_df[(pbrand_df.star_rating != 0.0)]
pbrand_df["star_rating"].value_counts()

Now that we have successfully removed the rating of 0.0, we can now visualize the dataset.

Since we are comparing Intel vs AMD, we will group the dataset by processor_brand. Then, we get the mean star_rating for each.

In [None]:
pmean_df = pbrand_df.groupby("processor_brand").agg({"star_rating": ["mean"]})
pmean_df = pmean_df.squeeze()
pmean_df
pmean_df.plot.barh(figsize=(12,5)).invert_yaxis()
plt.ylabel('Processor Brand')
plt.xlabel('Star Rating')
plt.title('Laptop Rating For Intel vs AMD')

We can see that the average star rating of AMD laptops is a little higher than the average star rating of Intel laptops.

**3) What is the average rating of laptops for each processor of Intel and AMD?**

Let us get all the processor brands in the dataset.

In [None]:
clean_df["processor_brand"].value_counts()

Next, we remove the processor brands that are not Intel or AMD.

In [None]:
pname_df = clean_df[(clean_df.processor_brand == "Intel") | (clean_df.processor_brand == "AMD")]
pname_df["processor_brand"].value_counts()

Now that we have removed them, let us get all the values of star_rating in the dataset.

In [None]:
pname_df["star_rating"].value_counts()

We can see that there are ratings of 0.0 in the dataset. Let us remove it to prevent it effecting the mean later on.

In [None]:
rating_df = pname_df[(pname_df.star_rating != 0.0)]
rating_df["star_rating"].value_counts()

Now that we have removed it, we can now visualize the dataset.

Since we are to compare the mean star_rating of each type of processor of Intel and AMD, we will group the dataset by processor name. Then, we get the mean star rating of each processor.

In [None]:
mean_df = rating_df.groupby("processor_name").agg({"star_rating": ["mean"]})
mean_df = mean_df.squeeze()
mean_df
mean_df.plot.barh(figsize=(15,10)).invert_yaxis()
plt.ylabel('Processor Name')
plt.xlabel('Star Rating')
plt.title('Laptop Rating For Intel and AMD Processors')

From the graph, we can see the average star_rating of each processor. We can see that the processors with the highest rating are Core i9, Ryzen 9, and Hexa Core. On the other hand, the processors with the lowest star rating are Core m3, Celeron Dual, and A6-9225 Processor.

**Research Question**

After performing all the exploratory data analysis, we were intrigued by what could possibly cause the differences in the star ratings of the laptops. 

This lead us to come up with the following research question: What are the most influential factors for the star rating of a laptop?

From the EDAs performed above, we saw possible factors that could influence the star rating of a laptop. 

The first EDA showed that there is a positive correlation between the price of the laptop and its rating. The second EDA showed that the star rating of AMD vs Intel laptops are not that far apart. The third EDA showed that the star rating of laptops varies from one another based on the processor. However, there are still other factors that could affect the star ratings of the laptop. Like in the first EDA, a higher price would usually not be favorable for a consumer yet the laptops with higher ratings seem to increase as laptop prices go up. This could mean that there are multiple factors to get higher ratings for laptops. Based on our 2nd and 3rd EDA's, it is possible that brands and specific processors are factors for why these laptops get higher ratings. The other variables like ram gb, weight, graphic card gb, screen size may all be factors besides this. Finding out the most influential factors for a rating of a laptop is important as it allows manufacturers to see what improvements they could make or what they could focus on when creating a laptop.