## Introduction to Data Cleaning

Run the first cell and import the laptops dataset.

You can find more information about the laptops dataset from the [kaggle data source](https://www.kaggle.com/ionaskel/laptop-prices)

We've also used the .info() command to pring the columns, number of values, and type of each column.

In [2]:
import pandas as pd
laptops = pd.read_csv("./laptops-raw.csv", encoding = "Latin-1", index_col=0)
print(laptops.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1303 entries, 1 to 1320
Data columns (total 12 columns):
Company              1303 non-null object
Product              1303 non-null object
Type-Name            1303 non-null object
Inches               1303 non-null float64
Screen-Resolution    1303 non-null object
Cpu                  1303 non-null object
Ram                  1303 non-null object
Memory               1303 non-null object
Gpu                  1303 non-null object
OpSys                1303 non-null object
Weight               1303 non-null object
Price(euros)         1303 non-null float64
dtypes: float64(2), object(10)
memory usage: 132.3+ KB
None


Use ```head()``` and inspect the first few rows of the dataframe. Think about the folowing questions when first lookign at your data.

Do we have any missing values?
What are some things you think might need cleaning?
Are column headers all uniform?
Are there columns with multiple similar entries?
Are there columns that contain numeric and text information?
How could we transform categorical features into numeric?

In [22]:
#Your Code Here

#### Formatting column headers

Column headers are typically made lower case and use underscores to separate words. This convention makes it easy to access columns when slicing.

1. Iterate through the columns. You can access columns of a dataframe with the ```.columns``` accessor.
2. Use the ```.lower()``` and ```.strip()``` method on each element of columns and make all columns lowercase and remove extra spaces.
3. Assign the new list to the original dataframe again by accessing it with columns. Hint ```laptops.columns = #your solution here```

Bonus: Do steps 1-3 in a one line command. Hint use list comprehension.

In [23]:
#Your Code Here



Create a cleaning function to further clean the column names.

The function should take in a string and do the following:
1. Change the 'opsys' column to 'os' using the ```.replace()``` method
2. Add an _ for any spaces between words.
3. Remove any ( or ) characters

The function should return a string. Use your function in the list comprehension and assign the new values to the dataframe column headers. 

In [12]:
def cleaner(string):
    #Your Code Here
    return string

In [13]:
#Run when your function is ready
#Add your function here.
laptops.columns = [cleaner(col) for col in laptops.columns]
laptops.columns

Index(['company', 'product', 'type-name', 'inches', 'screen-resolution', 'cpu',
       'ram', 'memory', 'gpu', 'os', 'weight', 'priceeuros'],
      dtype='object')

#### Modify Column Data

Inspect the 'ram' column. Right now this column isn't useful because we have numeric and text data mixed. How do you think we can make it useful as a numeric column?

1. Use the ```Series.unique()``` method and identify the unique values in the column. Print out the result.
2. Use what you learned in the above cell to call ```Series.str.replace()``` and remove the text from the 'ram' column.
3. Assign the result back to the laptops 'ram' column.
4. Use ```.info()``` and inspect the column's datatype. What do you notice?


In [24]:
#Your Code Here

We changed the 'ram' column to only numeric values. Notice how when we called ```.info()``` it was still an object datatype? We need to transform the column to a numeric only type.

1. Use ```.astype()``` to convert ```laptops['ram']``` to an appropriate datatype

In [10]:
#Your Code Here

#### Extracting and Mapping Data

1. Run value counts on the 'gpu' column. How many different GPUs are there?
2. The example code reuces the 'gpu' column to a handful of manufacturers and saves the result in a new column.
3. Investigate the 'cpu' column. 
    - Can we reduce the number of manufactuerers like we did for 'gpu'? 
    - Save the result in a new column under the name 'cpu manifacturers'
    - How many cpu manufacturers are there?

In [17]:
#Your Code Here

Intel HD Graphics 620      281
Intel HD Graphics 520      185
Intel UHD Graphics 620      68
Nvidia GeForce GTX 1050     66
Nvidia GeForce GTX 1060     48
                          ... 
Intel HD Graphics 530        1
AMD Radeon R7 M360           1
AMD Radeon R9 M385           1
AMD Radeon R5 520            1
Nvidia GeForce 960M          1
Name: gpu, Length: 110, dtype: int64

In [18]:
laptops["gpu_manufacturer"] = (laptops["gpu"]
                                       .str.split()
                                       .str[0]
                              )
#Your Code Here


Intel      1240
AMD          62
Samsung       1
Name: cpu_manufacturer, dtype: int64


We can change elements within a series or dataframe with a map. A map is a dictionary that tells the function which vlaues to change, and what they should be changed to.
​
1. Use ```.value_counts()``` to inspect the values in the 'os' column
2. Define a map that standardizes the entries. I.e. Windows 7 and Windows 10 S would become windows
3. Use the method ```Series.map()``` with the dictionary to change the values in the 'os' column. Reassign your result back to the 'os' column.

In [14]:
#Your Code Here

Windows 10      1072
No OS             66
Linux             62
Windows 7         45
Chrome OS         27
macOS             13
Mac OS X           8
Windows 10 S       8
Android            2
Name: os, dtype: int64

In [19]:
#Your Code Here


laptops['os']

1       macOS
2       macOS
3       No OS
4       macOS
5       macOS
        ...  
1316      NaN
1317      NaN
1318      NaN
1319      NaN
1320      NaN
Name: os, Length: 1303, dtype: object

#### Dropping missing values
1. We can inspect the number of nan values in each column with the .isnull() method.
    - How many missing values are there in each column?
    - Are there any columns with only nan values?
    - What percentage of missing values are there per row?
2. Use the .dropna() method to drop any rows with missing values.
3. On a copy of the laptops dataframe drop any columns that have missing values with .dropna()

In [25]:
#Your Code Here

In [21]:
#Your Code Here

company               0.000000
product               0.000000
type-name             0.000000
inches                0.000000
screen-resolution     0.000000
cpu                   0.000000
ram                   0.000000
memory                0.000000
gpu                   0.000000
os                   82.271681
weight                0.000000
priceeuros            0.000000
gpu_manufacturer      0.000000
cpu_manufacturer      0.000000
dtype: float64

In [17]:
#Your Code Here

#### Binning: Discreet to categorial
1. Investigate the 'inches' column with .value_counts()
    - Is the information useful in its current form?
    - If no, how can we make this information more useful?
2. Use pd.cut() to construct three bins from the 'inches' column
3. Assign the result to a new column 'screen_size'.
4. Call .value_counts() on the new column.
    - Is this more useful to categorize laptops by screen size?
    - How useful/not useful is doing something like this?
    - When would this make sense to do?

In [None]:
#Your Code Here

#### Conversions and exporting
1. Convert the values in the weight column to numeric values.
2. Rename the weight column to weight_kg.
3. Use the .to_csv() method to save the laptops dataframe to a CSV file laptops_cleaned.csv without index labels.

In [None]:
#Your Code Here