# Pandas and NumPy Fundamentals
## Data Cleaning Basics

Our dataset is ready for some analysis, but there are still some data cleaning tasks left! Here are your next steps:

Convert the price_euros column to a numeric dtype.
Extract the screen resolution from the screen column.
Extract the processor speed from the cpu column.
Here are some questions you might like to answer in your own time by analyzing the cleaned data:

Are laptops made by Apple more expensive than those made by other manufacturers?
What is the best value laptop with a screen size of 15" or more?
Which laptop has the most storage space?
The final mission in our course is a guided project, where we'll put everything together to clean and analyze a dataset using pandas!

### Reading CSV Files with Encodings
1. Import the pandas library
2. Use the pandas.read_csv() function to read the laptops.csv file into a dataframe laptops.
    - Specify the encoding using the string "Latin-1".
3. Use the DataFrame.info() method to display information about the laptops dataframe.

In [1]:
import pandas as pd

laptops = pd.read_csv("../laptops.csv", encoding="Latin-1")
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


In [2]:
import pandas as pd

laptops = pd.read_csv("../laptops.csv", encoding="Latin-1")
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Manufacturer              1303 non-null   object
 1   Model Name                1303 non-null   object
 2   Category                  1303 non-null   object
 3   Screen Size               1303 non-null   object
 4   Screen                    1303 non-null   object
 5   CPU                       1303 non-null   object
 6   RAM                       1303 non-null   object
 7    Storage                  1303 non-null   object
 8   GPU                       1303 non-null   object
 9   Operating System          1303 non-null   object
 10  Operating System Version  1133 non-null   object
 11  Weight                    1303 non-null   object
 12  Price (Euros)             1303 non-null   object
dtypes: object(13)
memory usage: 132.5+ KB


### Cleaning Column Names
1. Remove any whitespace from the start and end of each column name.
    - Create an empty list named new_columns.
    - Use a for loop to iterate through each column name using the DataFrame.columns attribute. Inside the body of the for loop:
        - Use the str.strip() method to remove whitespace from the start and end of the string.
        - Append the updated column name to the new_columns list.
    - Assign the updated column names to the DataFrame.columns attribute.

In [3]:
new_columns = []

for column_name in laptops.columns:
    new_columns.append(column_name.strip())
    
laptops.columns = new_columns

### Cleaning Column Names Continued
1. Define a function, which accepts a string argument, and:
    - Removes any whitespace from the start and end of the string.
    - Replaces the substring Operating System with the abbreviation os.
    - Replaces all spaces with underscores.
    - Removes parentheses from the string.
    - Makes the entire string lowercase.
    - Returns the modified string.
2. Use a loop to apply the function to each item in the DataFrame.columns attribute for the laptops dataframe. Assign the result back to the DataFrame.columns attribute.

In [4]:
new_column = []

for column_name in laptops.columns:
    if column_name.startswith("Operating System"):
        column_name = column_name.replace("Operating System", "os")        
    
    new_column.append(column_name.strip().replace("(", "").replace(")", "").replace(" ", "_").lower())
        
laptops.columns = new_column

### Converting String Columns to Numeric
1. Use the Series.unique() method to identify the unique values in the ram column of the laptops dataframe. Assign the result to unique_ram.
2. After running your code, use the variable inspector to view the unique values in the ram column and identify any patterns.

In [5]:
unique_ram = laptops["ram"].unique()

### Removing Non-Digit Characters
1. Use the Series.str.replace() method to remove the substring GB from the ram column.
2. Use the Series.unique() method to assign the unique values in the ram column to unique_ram.
3. After running your code, use the variable inspector to verify your changes.

In [6]:
laptops["ram"] = laptops["ram"].str.replace("GB", "")
unique_ram = laptops["ram"].unique()

### Converting Columns to Numeric Dtypes
1. Use the Series.astype() method to change the ram column to an integer dtype.
2. Use the DataFrame.dtypes attribute to get a list of the column names and types from the laptops dataframe. Assign the result to dtypes.
3. After running your code, use the variable inspector to view the dtypes variable to see the results of your code.

In [7]:
laptops["ram"] = laptops["ram"].astype(int)
dtypes = laptops.dtypes

### Renaming Columns
1. Because the GB characters contained useful information about the units (gigabytes) of the laptop's ram, use the DataFrame.rename() method to rename the column from ram to ram_gb.
2. Use the Series.describe() method to return a series of descriptive statistics for the ram_gb column. Assign the result to ram_gb_desc.
3. After you have run your code, use the variable inspector to see the results of your code.

In [8]:
laptops.rename({"ram":"ram_gb"}, axis=1, inplace=True)
ram_gb_desc = laptops["ram_gb"].describe()

### Extracting Values from Strings
In the example code, we have extracted the manufacturer name from the gpu column, and assigned it to a new column gpu_manufacturer.

1. Extract the manufacturer name from the cpu column. Assign it to a new column cpu_manufacturer.
2. Use the Series.value_counts() method to find the counts of each manufacturer in cpu_manufacturer. Assign the result to cpu_manufacturer_counts.

In [9]:
laptops["gpu_manufacturer"] = (laptops["gpu"]
                                       .str.split()
                                       .str[0]
                              )
laptops["cpu_manufacturer"] = (laptops["cpu"]
                               .str.split()
                               .str[0]
                              )
cpu_manufacturer_counts = laptops["cpu_manufacturer"].value_counts()

### Correcting Bad Values
We have created a dictionary for you to use with mapping. Note that we have included both the correct and incorrect spelling of macOS as keys, otherwise we'll end up with null values.

1. Use the Series.map() method with the mapping_dict dictionary to correct the values in the os column.

In [10]:
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}

laptops["os"] = laptops["os"].map(mapping_dict)

### Dropping Missing Values
1. Use DataFrame.dropna() to remove any rows from the laptops dataframe that have null values. Assign the result to laptops_no_null_rows.
2. Use DataFrame.dropna() to remove any columns from the laptops dataframe that have null values. Assign the result to laptops_no_null_cols.

In [11]:
laptops_no_null_rows = laptops.dropna()
laptops_no_null_cols = laptops.dropna(axis=1)

### Filling Missing Values
1. Use a boolean array to identify rows that have the value No OS for the os column. Then, use assignment to assign the value Version Unknown to the os_version column for those rows.
2. Use the syntax below to create value_counts_after variable: 

`value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()`

3. After running your code, use the variable inspector to look at the difference between value_counts_before and value_counts_after.

In [12]:
value_counts_before = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()
laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"
laptops.loc[laptops["os"] == "No OS", "os_version"] = "Version Unknown"
value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

### Challenge: Clean a String Column
1. Convert the values in the weight column to numeric values.
2. Rename the weight column to weight_kg.
3. Use the DataFrame.to_csv() method to save the laptops dataframe to a CSV file laptops_cleaned.csv without index labels.

In [13]:
laptops["weight"] = laptops["weight"].str.replace("kgs", "").str.replace("kg", "").astype(float)
laptops.rename({"weight":"weight_kg"}, axis=1, inplace=True)
laptops.to_csv("laptops_cleaned.csv", index=False)

### Next Steps

#### Convert the price_euros column to a numeric dtype.

In [14]:
laptops["price_euros"] = laptops["price_euros"].str.replace(",", ".").astype(float)

#### Extract the screen resolution from the screen column.

In [15]:
laptops["screen_resolution"] = laptops["screen"].str.split().str[-1]
laptops["screen"] = laptops["screen"].str.split().str[:-1].str.join(" ")
laptops.rename({"screen":"screen_specs"}, axis=1, inplace=True)
laptops["screen_size"] = laptops["screen_size"].str.replace("\"", "").astype(float)

In [16]:
laptops.loc[laptops["screen_specs"].str.split().str.len() == 0, "screen_specs"] = "No Information"

#### Extract the processor speed from the cpu column.

In [17]:
laptops["cpu_ghz_frequency"] = laptops["cpu"].str.split().str[-1]
laptops["cpu_ghz_frequency"] = laptops["cpu_ghz_frequency"].str.replace("GHz", "").astype(float)

In [18]:
laptops["cpu"] = laptops["cpu"].str.split().str[:-1].str.join(" ")

### Analyzing Cleaned Data

#### Are laptops made by Apple more expensive than those made by other manufacturers?

In [None]:
mean_price = {}
for brand in laptops["manufacturer"].unique():
    selected_cols = laptops[laptops["manufacturer"] == brand]
    
    mean_price[brand] = selected_cols["price_euros"].mean()
    
print(f'The most expensive brand in the dataset is: "{max(mean_price, key=mean_price.get)}", while the least expensive is: "{min(mean_price, key=mean_price.get)}"')

#### What is the best value laptop with a screen size of 15" or more?

In [None]:
laptops[laptops["screen_size"] >= 15.0].sort_values(by="price_euros").iloc[1]

#### Which laptop has the most storage space?

In [None]:
laptops[laptops["storage"].str.contains("1TB HDD +")].sort_values(by="storage", ascending=False)

In [19]:
laptops[laptops["storage"].str.contains("1TB HDD +")].sort_values(by="storage", ascending=False)

The most expensive brand in the dataset is: "Razer", while the least expensive is: "Vero"


#### What is the best value laptop with a screen size of 15" or more?

In [48]:
laptops[laptops["screen_size"] >= 15.0].sort_values(by="price_euros").iloc[1]

manufacturer                                  Acer
model_name                           Chromebook 15
category                                  Notebook
screen_size                                   15.6
screen_specs                        No Information
cpu                  Intel Celeron Dual Core 3205U
ram_gb                                           4
storage                                   16GB SSD
gpu                              Intel HD Graphics
os                                       Chrome OS
os_version                                     NaN
weight_kg                                      2.2
price_euros                                    209
gpu_manufacturer                             Intel
cpu_manufacturer                             Intel
screen_resolution                         1366x768
cpu_ghz_frequency                              1.5
Name: 1102, dtype: object

#### Which laptop has the most storage space?

In [62]:
laptops[laptops["storage"].str.contains("1TB HDD +")].sort_values(by="storage", ascending=False)

Unnamed: 0,manufacturer,model_name,category,screen_size,screen_specs,cpu,ram_gb,storage,gpu,os,os_version,weight_kg,price_euros,gpu_manufacturer,cpu_manufacturer,screen_resolution,cpu_ghz_frequency
703,Lenovo,V310-15IKB (i5-7200U/4GB/1TB/FHD/W10),Notebook,15.6,Full HD,Intel Core i5 7200U,4,1TB HDD + 1TB HDD,Intel HD Graphics 620,Windows,10,2.1,621.45,Intel,Intel,1920x1080,2.5
