# Laptops Data Prep for *Top Tech*

## About
*Top Tech* laptop data including kinds of laptops and company sells,  provided by the IT manager. Data was input by hand so the file needs to be cleaned and prepared to be run through a machine learning algorithms.


In this analysis/cleaning we:
- Address null values
- Discover and drop duplicates
- Determine average amount of primary storage
- Standardize Company names
- Find Outliers
- and more!

In [None]:
import pandas as pd
df = pd.read_csv('laptops.csv')

In [None]:
df.info()

Dropping all rows from the data set if they contain a null value in at least one column:

In [None]:
df.dropna()

In [None]:
#Reimport data:
import pandas as pd
df = pd.read_csv('laptops.csv')

In [None]:
df.info()

The column "memory_type" has the most null values; 932 null values.

Descriptive statistics on all numerical columns in the data set:

In [None]:
df.describe()

Looking at the above data, we see that the minimum primary_storage is -2048 and the minimum price is -1.42790.

To fix this, we are changing all of the negative values in the columns identified above to positive values.

In [None]:
df[['primary_storage', 'Price']] = df[['primary_storage', 'Price']].abs()

In [None]:
df.describe()

With that correction, we can now determine the average amount of primary storage in a laptop within the data set: 447.58.

In [None]:
df['primary_storage'].mean().round(2)

We are also able to determine the average price of a laptop: $61,014(non-USD)

In [None]:
df['Price'].mean().round(2)

Next we impute the missing values in the resolution_width column by dividing the total_pixels by the resolution_height. The average resolution width of a laptop is 1071.42.

In [None]:
df['resolution_width'] = df['resolution_width'].fillna(df['total_pixels'] / df['resolution_height'])

In [None]:
df.info()

In [None]:
df.describe()

As previously noted, the memory_type column has 932 null values. Being that this column has more null values than not and is generally unimportant to this specific analysis, the memory_type column has been removed.

In [None]:
df.drop(columns=['memory_type'] , inplace =True)

In [None]:
df.info()

Now that the memory_type column won't affect our next steps, we'll try dropping all null values and see there 717 rows remaining.

In [None]:
df.dropna()

The Company column still contains errors such as standardization issues - some comapany names are capitalized sometimes, and othertimes not.

In [None]:
df['Company'].value_counts()

In [None]:
df['Company'] = df['Company'].str.title()

In [None]:
df['Company'].value_counts()

Some laptops have extremely high prices. Using the z-score method with a threshold of 3 to create a filter, we'll determine which laptops are outliers.

In [None]:
AvgPrice = df['Price'].mean()
StdDevPrice = df['Price'].std()

In [None]:
df['Z-Score'] = (df['Price'] - AvgPrice) / StdDevPrice
df.head()

In [None]:
df.loc[df['Z-Score'] > 3]

Using the above outlier detection method, there are 9 outliers exist in this data set according to Price.
Since there aren't many, we are only noting this. However, there's enough data to determine an average, correct price if needed.

Dropping duplicates:

In [None]:
df.drop_duplicates(subset = ['Company', 'cpu_name', 'gpu_name', 'total_pixels', 'Ram'], inplace = True)
df.shape