## Project: Data Cleaning with Pandas

#### Below will be a series of user stories, followed by an empty Python code block
* These user stories will go through the process of importing, cleaning, and exporting the included `dirty_cars_dataset.csv` file
* Be sure to read each question carefully, and to test and debug your code to ensure the user story is completed correctly!


### As a Data Analyst, I want to set up the proper imports so I have access to the Pandas library

In [1]:
import pandas as pd

### As a Data Analyst, I want to import and store the `dirty_cars_dataset.csv` file in a variable
✅ I want to use the `index` column from this .csv file as the `index column` of my DataFrame

In [2]:
dirty_cars_df = pd.read_csv("dirty_cars_dataset.csv")
print(dirty_cars_df)

    index      company   body-style  wheel-base  length engine-type  \
0       0  alfa-romero  convertible        88.6   168.8        dohc   
1       1  alfa-romero  convertible        88.6   168.8        dohc   
2       2  alfa-romero    hatchback        94.5   171.2        ohcv   
3       3         audi        sedan        99.8   176.6         ohc   
4       4         audi        sedan        99.4   176.6         ohc   
..    ...          ...          ...         ...     ...         ...   
59     82   volkswagen        sedan        97.3   171.7         ohc   
60     83   volkswagen        sedan        97.3   171.7         ohc   
61     86   volkswagen        sedan        97.3   171.7         ohc   
62     87        volvo        sedan       104.3   188.8         ohc   
63     88        volvo        wagon       104.3   188.8         ohc   

   num-of-cylinders  horsepower  average-mileage      price  
0              four         111               21    13495.0  
1              four    

### As a Data Analyst, I want to view the **information** about my new DataFrame to answer the following questions:
##### Enter your responses in the Markdown block below

* How many entries are in this DataFrame: 
* How many columns are in this DataFrame:
* Which column(s) contain null values in this DataFrame:

This DataFrame contains 64 rows and 10 columns.
The price column appears to contain null values in this DataFrame.

### As a Data Analyst, I want to to remove any null values from the DataFrame
✅ I want to **create a new DataFrame variable** when I remove these null values<br>
✅ Then, I want to display the **information** about my new DataFrame, to confirm the null values were successfully removed

In [13]:
nonnull_dirty_cars_df = dirty_cars_df.dropna()
nonnull_dirty_cars_df.info()
# print(nonnull_dirty_cars_df)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61 entries, 0 to 63
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             61 non-null     int64  
 1   company           61 non-null     object 
 2   body-style        61 non-null     object 
 3   wheel-base        61 non-null     float64
 4   length            61 non-null     float64
 5   engine-type       61 non-null     object 
 6   num-of-cylinders  61 non-null     object 
 7   horsepower        61 non-null     int64  
 8   average-mileage   61 non-null     int64  
 9   price             61 non-null     float64
dtypes: float64(3), int64(3), object(4)
memory usage: 5.2+ KB


### As a Data Analyst, I want to check if there are any **duplicate rows** within my DataFrame

In [14]:
find_dupes_df = nonnull_dirty_cars_df.duplicated()
print(find_dupes_df.to_string())


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False


### As a Data Analyst, I want to **remove** any duplicate values from the DataFrame
✅ I want to **create a new DataFrame variable** when I remove these duplicate values<br>
✅ I want to again check if there are any duplicate rows within my DataFrame, to ensure the values were removed successfully

In [38]:
dupes_removed_df = nonnull_dirty_cars_df.drop_duplicates()
# print(dupes_removed_df)

### As a Data Analyst, I want to ensure I remove any outlier values from my DataFrame to avoid inaccurate analysis of my data
✅I want to **create a new DataFrame variable** when I remove these values<br><br>
💡 **Hint:** These inaccuracies will be within the `price` column 💡<br><br>
💡 **Hint** There will be both **high** and **low** outlier values 💡

In [37]:
# Removing high outlier values
find_outliers_df = dupes_removed_df.sort_values("price")
# print(find_outliers_df.to_string())

high_outlier_removed = find_outliers_df.drop(labels=2)
all_outliers_removed = high_outlier_removed.drop(labels=42)

# print(all_outliers_removed.to_string())

### As a Data Analyst, I want to reformat the **company** series, ensuring all company name values are properly title (Pascal) cased

In [36]:
# renamed_df = all_outliers_removed.rename(columns={ "company": "Company"})
# print(renamed_df.to_string())

all_outliers_removed["company"]= all_outliers_removed["company"].str.title()
# all_outliers_removed.to_string()

### As a Data Analyst, I want to create a ***new*** column on my DataFrame to represent the **price of each car in Euros**
💡 **Use the conversion rate 1.05 USD == 1 Euro** 💡

In [35]:
all_outliers_removed["price in euros"] = all_outliers_removed ["price"] * 1.05
# all_outliers_removed

### As a Data Analyst, I want to rename the existing **price** column to show that it represents **price in USD**

In [34]:
updated_df = all_outliers_removed.rename(columns= {"price": "price in USD"})
# updated_df

### As a Data Analyst, I want to output my cleaned DataFrame as a .csv file
✅ I want to name this file `cleaned_cars_dataset.csv`<br>
✅ I want to specify the encoding type 'utf-8'<br>
✅ I want to include this .csv file in my GitHub repository

In [39]:
# cleaned_cars_dataset.csv = updated_df.to_csv
updated_df.to_csv("cleaned_cars_dataset.csv", encoding="utf-8")