<a href="https://colab.research.google.com/github/Twikam218/car_price/blob/main/P_133.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instructions

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the lesson:

**Movie Recommender System**

---

#### Getting Started:

1. Follow the next 3 steps to create a copy of this colab file and start working on the project.

2. Create a duplicate copy of the Colab file as described below.

  - Click on the **File menu**. A new drop-down list will appear.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/0_file_menu.png' width=500>

  - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/1_create_colab_duplicate_copy.png' width=500>

3. After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_Project133** format.

4. Now, write your code in the prescribed code cells.

---

#### Problem Statement

Data Cleaning is the most important thing in Data Science. While working on a real project, you are about to spend 80% of your time in cleaning the data.

In this project, you are given a dummy dataset of cars details. Perform different operations (removing duplicate values, converting data type, etc.) to clean the data.

---

### Dataset Description

The DataFrame consists of the following columns:

|Field|Description|
|---:|:---|
| Brand | Brand of Cars |
| Model | List of models |
| Year | Date of Manufacturing |
| Color | Color Available |
| Price | Prices of the models (in USD) |


---

### List of Activities

**Activity 1:** Import Modules and Read Data

**Activity 2:** Convert List-Type Strings into List

**Activity 3:** Remove Duplicate Value

**Activity 4:** Explode the DataFrame

---

#### Activity 1: Import Modules and Read Data

1. Import the necessary Python modules.

2. Read the data from the dummy variable to create a Pandas DataFrame and go through the necessary data-cleaning process (if required).


In [None]:
# Import the Python modules and the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Dummy Dataset
cars = {'Brand': ['Ford', 'Toyota', 'BMW', 'Ford'],
        'Model': ["['Endeavour','EcoSport','Figo','Aspire','FreeStyle']",
                  "['Camry','Fortuner','Vellfire','Innova','Glanza']",
                  "['Sedan','Gran Turismo','Gran Coupe','Roadster','iX']",
                  "['Endeavour','EcoSport','Figo','Aspire','FreeStyle']"],
        'Year': [2003, 2004, 2008, 2009],
        'Price': [26000, 21000, 35000, None]}

# Make a DataFrame from the Dummy Dataset
car_df = pd.DataFrame(cars)#, index =['Brand', 'Model', 'Year', 'Price'])
print(car_df)

    Brand                                              Model  Year    Price
0    Ford  ['Endeavour','EcoSport','Figo','Aspire','FreeS...  2003  26000.0
1  Toyota  ['Camry','Fortuner','Vellfire','Innova','Glanza']  2004  21000.0
2     BMW  ['Sedan','Gran Turismo','Gran Coupe','Roadster...  2008  35000.0
3    Ford  ['Endeavour','EcoSport','Figo','Aspire','FreeS...  2009      NaN


Get the information on the dataset.

In [None]:
# Get the total number of rows and columns, data types of columns, and
# missing values (if exist) in the dataset.
print(car_df.info())
print(car_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Brand   4 non-null      object 
 1   Model   4 non-null      object 
 2   Year    4 non-null      int64  
 3   Price   3 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes
None
Brand    0
Model    0
Year     0
Price    1
dtype: int64


**Q:** Are there any missing values in the dataset?

**A:** yes

**Q:** Are there any non-numeric columns?

**A:**yes

---

#### Activity 2: Convert List-Type Strings into List

Convert the `Model` column which contains list-type string elements into a list data type using `literal_eval()` and lambda function.

Follow the steps given below:
1. Use `apply()` function for `cars_df['Model']` column and pass a lambda function as input to the `apply()` function. This lambda function must perform the following tasks in a single line using list comprehension:
 - Parse or unstring list of model names obtained at each row using `literal_eval()` function. This new list will be returned by the lambda function.

2. Store the list returned by lambda function as `'Model'` column in the DataFrame. This will replace the existing `'Model'` column with values returned by the lambda function.

3. Print the first 5 rows of the DataFrame to verify whether the `'Model'` column contains the modified values or not.

In [None]:
# Use lambda function and 'literal_eval()' to obtain a list of models.
from ast import literal_eval
car_df['Model'] =  car_df['Model'].apply(lambda x:literal_eval(x))

Also, check the data type of the first row of `cars_df['Model']` to verify whether its data type is list or not.

In [None]:
# Check for the data type of the first row of column 'Model'
type(car_df['Model'])

**Q:** What is the type of the first row of the `Model` column?

**A:**



---



#### Activity 3: Remove Duplicate Values

Identify and display the duplicate `'Brand'` entries in the DataFrame using `duplicated()` function.

In [None]:
# Identify and display the duplicate entries in the cars_df DataFrame
car_df[car_df.duplicated('Brand', keep=False)].sort_values(by='Brand')


Unnamed: 0,Brand,Model,Year,Price
0,Ford,"[Endeavour, EcoSport, Figo, Aspire, FreeStyle]",2003,26000.0
3,Ford,"[Endeavour, EcoSport, Figo, Aspire, FreeStyle]",2009,


Drop all the duplicate rows using the steps given below:
1. Call the `drop_duplicates()` function of the DataFrame and pass `subset = ['Brand']` as input.

2. Verify the removal of duplicate entries based on the `'Brand'` field using `duplicated()` function.

In [None]:
# Drop the duplicate rows on 'Model' column
subset=car_df.drop_duplicates(subset="Brand")

# Verify removal of duplicated entries on 'Model' column
subset[subset.duplicated("Brand")]

Unnamed: 0,Brand,Model,Year,Price


Convert the data type of `Price` column into `int` using `astype()` function.

In [None]:
# Convert the data type of 'Price' column into 'int'
subset['Price']=subset['Price'].astype('int')

# Get the total number of rows and columns and data-types of columns
subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Brand   3 non-null      object
 1   Model   3 non-null      object
 2   Year    3 non-null      int64 
 3   Price   3 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 120.0+ bytes


**Q:** After converting the data type of a column using `astype`, will values get converted too?

**A:**



---



#### Activity 4: Explode the DataFrame

The elements in the `'Model'` column of the DataFrame consists of lists. Unpack or expand the list of model names such that each row contains only 1 model name rather than a list of names using `explode()` function of pandas DataFrame.

In [None]:
# Explode 'Model' column and give new index to every row by passing 'ignore_index = True' to 'explode()' function.
ex_df=subset.explode('Model', ignore_index = True)
ex_df.head()

Unnamed: 0,Brand,Model,Year,Price
0,Ford,Endeavour,2003,26000
1,Ford,EcoSport,2003,26000
2,Ford,Figo,2003,26000
3,Ford,Aspire,2003,26000
4,Ford,FreeStyle,2003,26000


**Q:** What will happen if `ignore_index` is set to false?

**A:**

---

### Submitting the Project:

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, make sure that '**Anyone on the Internet with this link can view**' option is selected and then click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>

3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_Project133**) of the notebook will get copied

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.
   
   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_Project133** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>

---