### Phase 2 - Cleaning the Data
### Steps
1. Check Total Records and bird's eye view observation
2. Check the duplicate records ( all rows)
5. Drop row base on the condition (if necessary otherwise you can skip this but be careful while deleting the entire row it might impact the accuracy of analysis)
3. Check duplicate values in each column 
4. Rename the columns names best suited for the nature of column
4. Check the Null / Empty cells
5. Finding Missing Values such as (NA,None,NaN,NoT)
3. Change the Column Data Type
5. Delete the unnecessary Columns

In [2]:
# import Pandas Library / Package
import pandas as pd


In [3]:
# we are using our own generated Real life fake data of restaurant to practice this phase
df = pd.read_csv('../Data-Analysis-Beginner/CSV Files/restaurant_customers_fake_data_copy.csv')

#### 1. Check Total Records and bird's eye view observation

In [4]:
# 1 Check all record using info method
# this function can help us find the total records, total columns, data type of each columns
df.info()
## PRIMARY ANALYSIS / OBSERVATION
    # 1. We have total of 500 records in the file
    # 2. Name,Email,Phone Number columns does not  have any empty cells
    # 3. Visit Date, Total Bill, Payment Method, Rating and Feedback columns have many empty cells, which is not a very good sign for the Data Analysis

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            10 non-null     object 
 1   Email           10 non-null     object 
 2   PhoneNumber     10 non-null     object 
 3   Visit Date      7 non-null      object 
 4   Total Bill      6 non-null      float64
 5   Payment Method  7 non-null      object 
 6   Rating          9 non-null      float64
 7   Feedback        4 non-null      object 
dtypes: float64(2), object(6)
memory usage: 772.0+ bytes


In [5]:
 # view data types
df.dtypes
## PRIMARY ANALYSIS / OBSERVATION
    # 1. All the data types are Object in this case,
    # 2. We need to change the data type of each column respective to the meaning of the column
    # 3. Name -> String, Email - String , Phone Number - Int, Visit Date -> Date, Total Bill ->  Float
    # 4.  Payment Method -> String, Rating -> Float and Feedback  ->  string

Name               object
Email              object
PhoneNumber        object
Visit Date         object
Total Bill        float64
Payment Method     object
Rating            float64
Feedback           object
dtype: object

In [6]:
## Check if there is any duplicate records
df.nunique()
## PRIMARY ANALYSIS / OBSERVATION
    # 1. We have 502 rows but here we are getting only 500 unique rows
    # 2. We might have two duplicate rows in this record, we need to find it and delete the rows if necessary because having entire duplicate row does not make 
    # so much sense 


Name              10
Email             10
PhoneNumber       10
Visit Date         7
Total Bill         6
Payment Method     3
Rating             8
Feedback           4
dtype: int64

### 2. Check Total Duplicate Rows from the record

In [7]:
# duplicated function
# We can check the duplicate records using this method
df.duplicated().head(10)
## PRIMARY ANALYSIS / OBSERVATION
    # 1. Since we have two duplicate rows, we can see from the output that index 7 and index 4 has both duplicate rows
    # 2. We need to observe these duplicate rows careful before dropping it from the record
    # 3. If those are legit duplicate rows that can't make any further conclusion then we can drop those two rows

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [8]:
# drop_duplicated() function
# Helps us to drop the duplicate rows
df.drop_duplicates(keep='first',inplace=True)
## PRIMARY ANALYSIS / OBSERVATION
    # 1. We have dropped two duplicate rows, we now have 500 unique rows in the record
    # 2. inplace parameter will make changes on the actual file

df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

### 3. Lets drop the columns that are will not contribute us for Analysis

In [9]:
# Before dropping the columns
# lets us nunique function to check the unique value of columns
df.nunique()
## PRIMARY ANALYSIS / OBSERVATION
    # 1. Here Phone number is like unique for all the rows, we are going to drop that column
    # 2. we are also going to drop email,name column from this record,
    # 3. We are deleting those columns for practice purpose, depend upon your analysis nature you should come up with your own decisions for each columns


Name              10
Email             10
PhoneNumber       10
Visit Date         7
Total Bill         6
Payment Method     3
Rating             8
Feedback           4
dtype: int64

In [10]:
# drop function
# lets drop those columns
df.drop(columns=['Name','Email','PhoneNumber'],errors='raise',inplace=True)
# This should drop all the above columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Visit Date      7 non-null      object 
 1   Total Bill      6 non-null      float64
 2   Payment Method  7 non-null      object 
 3   Rating          9 non-null      float64
 4   Feedback        4 non-null      object 
dtypes: float64(2), object(3)
memory usage: 532.0+ bytes


### 4. Renaming the column 

In [11]:
# rename function
# This function helps us rename the column
# we can use the parameter such as inplace=True 
updateColName = {
    'Visit Date':'VisitDate',
    'Total Bill' : 'TotalBill',
    }
df.rename(columns=updateColName,inplace=True)
df.info()
# We can see that the Visit Date columns is renamed to VisitDate

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   VisitDate       7 non-null      object 
 1   TotalBill       6 non-null      float64
 2   Payment Method  7 non-null      object 
 3   Rating          9 non-null      float64
 4   Feedback        4 non-null      object 
dtypes: float64(2), object(3)
memory usage: 532.0+ bytes


### 5. Check the Null / Empty cells


In [12]:
#isnull function
# We can use isnull() function with .sum() to check total number of null values in each column
df.isnull().sum()
## PRIMARY ANALYSIS / OBSERVATION
    # 1. We can see that Name,Email and Phone Number columns dont have any null cells
    # 2. However, We have so many missing values in other columns, Infact some columns are missing almost 50% of the valuesS
df.head(10)

Unnamed: 0,VisitDate,TotalBill,Payment Method,Rating,Feedback
0,2024-06-28,301.87,,,
1,,,Credit Card,2.609698,Actually decision remain.
2,2024-01-07,309.56,Cash,3.103861,Outside store tree whole yourself movement war...
3,,273.49,,1.914464,
4,2024-06-27,493.98,Debit Card,2.011962,
5,2024-05-10,,Cash,6.0,
6,2024-03-04,,,1.100436,
7,,175.82,Cash,3.868506,Agree whom some road which other despite.
8,2024-04-27,,Debit Card,6.0,Attorney ground claim should.
9,2024-07-04,357.96,Debit Card,1.789233,


#### 6. Dropping either Rows / Columns with null/NaN

In [34]:
# Assign the DataFrame to new object
    ## PRIMARY ANALYSIS / OBSERVATION
        # 1. You should be extremely cautious while deleting/ dropping the rows or columns
        # 2. Because deleting either of them without proper understanding the analysis requirement and objective   
        # 3. of the analysis can impact the accuracy and outcome of the Analysis and which will ultimately impact the outcome of decision.

newDrdop = df.dropna(axis='columns',thresh=10,ignore_index=True) #Drop the column depend on the condition
#overriding the default assigned to the newdf array object
newDrdop = df.dropna(axis=0,ignore_index=True,subset=['Feedback'],thresh=1)
newDrdop.head(10)
# Export it to the CSV file format
# newdf.to_csv('row-delete.csv',index=True)

Unnamed: 0,VisitDate,TotalBill,Payment Method,Rating,Feedback
0,,23.0,Credit Card,2.609698,Actually decision remain.
1,2024-01-07,309.56,Cash,3.103861,Outside store tree whole yourself movement war...
2,,175.82,Cash,3.868506,Agree whom some road which other despite.
3,2024-04-27,23.0,Debit Card,6.0,Attorney ground claim should.


### 6. Filling the NaN/ Null Values

In [14]:
# Filling the NaN value with 2024-23-1 for the VisitDate Column
newFill = df.fillna(value = {"VisitDate":'2024-23-1'}) #Fill for specific column
# newdfill = df.interpolate(method='linear')
# newFill = df.fillna('4',limit=3)  #fill for all the columns with NaN
newFill.head(10)

Unnamed: 0,VisitDate,TotalBill,Payment Method,Rating,Feedback
0,2024-06-28,301.87,,,
1,2024-23-1,,Credit Card,2.609698,Actually decision remain.
2,2024-01-07,309.56,Cash,3.103861,Outside store tree whole yourself movement war...
3,2024-23-1,273.49,,1.914464,
4,2024-06-27,493.98,Debit Card,2.011962,
5,2024-05-10,,Cash,6.0,
6,2024-03-04,,,1.100436,
7,2024-23-1,175.82,Cash,3.868506,Agree whom some road which other despite.
8,2024-04-27,,Debit Card,6.0,Attorney ground claim should.
9,2024-07-04,357.96,Debit Card,1.789233,


### 7. Change the DataType  - Datetime -> Numeric 

In [15]:
# Usually in DataFrame, all the date / numbers are often imported as Object by default.
# Therefore, we need to convert these two and other columns to right data type

# Check the data type using dtypes method
changeType = df
# Convert Object to Datetime
changeType['VisitDate'].apply(pd.to_datetime)
# changeType.dtypes
changeType.head(4)


# changeType.duplicated()

Unnamed: 0,VisitDate,TotalBill,Payment Method,Rating,Feedback
0,2024-06-28,301.87,,,
1,,,Credit Card,2.609698,Actually decision remain.
2,2024-01-07,309.56,Cash,3.103861,Outside store tree whole yourself movement war...
3,,273.49,,1.914464,


### 8. Replacing the Invalid values in Column

In [16]:
#import Numpy package to replace invalid values
import numpy as np
changeType.TotalBill =changeType.TotalBill.replace(to_replace=np.NaN, value=23)
changeType.head(4)
# Replacing NaN in the TotalBill column with 23.00




Unnamed: 0,VisitDate,TotalBill,Payment Method,Rating,Feedback
0,2024-06-28,301.87,,,
1,,23.0,Credit Card,2.609698,Actually decision remain.
2,2024-01-07,309.56,Cash,3.103861,Outside store tree whole yourself movement war...
3,,273.49,,1.914464,


## 1 . Exercise & Challenges 
### Presidential Election Pool

In [17]:
exercise = pd.read_csv('../Data-Analysis-Beginner/CSV Files/president_general_polls_2016.csv')
# Run Info
exercise.head(6)


Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
0,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,ABC News/Washington Post,A+,...,45.20163,41.7243,4.626221,,,https://www.washingtonpost.com/news/the-fix/wp...,48630,76192,11/7/16,09:35:33 8 Nov 2016
1,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/1/2016,11/7/2016,Google Consumer Surveys,B,...,43.34557,41.21439,5.175792,,,https://datastudio.google.com/u/0/#/org//repor...,48847,76443,11/7/16,09:35:33 8 Nov 2016
2,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/2/2016,11/6/2016,Ipsos,A-,...,42.02638,38.8162,6.844734,,,http://projects.fivethirtyeight.com/polls/2016...,48922,76636,11/8/16,09:35:33 8 Nov 2016
3,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/4/2016,11/7/2016,YouGov,B,...,45.65676,40.92004,6.069454,,,https://d25d2506sfb94s.cloudfront.net/cumulus_...,48687,76262,11/7/16,09:35:33 8 Nov 2016
4,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,Gravis Marketing,B-,...,46.84089,42.33184,3.726098,,,http://www.gravispolls.com/2016/11/final-natio...,48848,76444,11/7/16,09:35:33 8 Nov 2016
5,2016,President,polls-plus,Clinton vs. Trump vs. Johnson,11/8/16,U.S.,11/3/2016,11/6/2016,Fox News/Anderson Robbins Research/Shaw & Comp...,A,...,49.02208,43.95631,3.057876,,,http://www.foxnews.com/politics/2016/11/07/fox...,48619,76163,11/7/16,09:35:33 8 Nov 2016


In [18]:
# nunique method
# columns with 1 unique value cycle,branch,matchup,forecastdate,multiversions
exercise.nunique()

cycle                   1
branch                  1
type                    3
matchup                 1
forecastdate            1
state                  57
startdate             352
enddate               345
pollster              196
grade                  10
samplesize           1767
population              4
poll_wt              4399
rawpoll_clinton      1312
rawpoll_trump        1385
rawpoll_johnson       584
rawpoll_mcmullin       16
adjpoll_clinton     12569
adjpoll_trump       12582
adjpoll_johnson      6629
adjpoll_mcmullin       57
multiversions           1
url                  1304
poll_id              4208
question_id          4208
createddate           222
timestamp               3
dtype: int64

In [31]:
# Dropping all the rows type is not equal to now-cast
exercise = exercise.query('type == "now-cast"')
exercise.head(3)
# exercise.info()

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
4216,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,45.04927,41.92541,7.657972,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:24:53 8 Nov 2016
4221,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,11/3/2016,11/4/2016,Public Policy Polling,B+,...,47.457,42.35281,2.199139,,,http://www.publicpolicypolling.com/pdf/2015/PP...,48349,75743,11/4/16,09:24:53 8 Nov 2016
4223,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Iowa,11/1/2016,11/4/2016,Selzer & Company,A+,...,39.36898,45.67372,5.995712,,,http://www.desmoinesregister.com/story/news/po...,48470,75957,11/5/16,09:24:53 8 Nov 2016


In [32]:
# Dropping all the columns with Single Unique value
# exercise = exercise.drop(columns=['branch','type','matchup','forecastdate','multiversions','timestamp'],inplace=True,errors='raise')
exercise.head(3)

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
4216,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,45.04927,41.92541,7.657972,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:24:53 8 Nov 2016
4221,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,11/3/2016,11/4/2016,Public Policy Polling,B+,...,47.457,42.35281,2.199139,,,http://www.publicpolicypolling.com/pdf/2015/PP...,48349,75743,11/4/16,09:24:53 8 Nov 2016
4223,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Iowa,11/1/2016,11/4/2016,Selzer & Company,A+,...,39.36898,45.67372,5.995712,,,http://www.desmoinesregister.com/story/news/po...,48470,75957,11/5/16,09:24:53 8 Nov 2016


In [21]:
# Drop all the rows where state is U.S>
exercise = exercise.query('state != "U.S."')
exercise.head(2)

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
4216,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,45.04927,41.92541,7.657972,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:24:53 8 Nov 2016
4221,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,11/3/2016,11/4/2016,Public Policy Polling,B+,...,47.457,42.35281,2.199139,,,http://www.publicpolicypolling.com/pdf/2015/PP...,48349,75743,11/4/16,09:24:53 8 Nov 2016


In [22]:
# Rename the columns
exercise.rename(columns={
    'rawpoll_clinton':'clinton_pct',
    'rawpoll_trump':'trump_pct'
})
exercise.head(2)

Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
4216,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,45.04927,41.92541,7.657972,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:24:53 8 Nov 2016
4221,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,11/3/2016,11/4/2016,Public Policy Polling,B+,...,47.457,42.35281,2.199139,,,http://www.publicpolicypolling.com/pdf/2015/PP...,48349,75743,11/4/16,09:24:53 8 Nov 2016


In [33]:
# Convert the Data Type
# list of datetime columns [startdate,enddate,createddate,forecastdate,timestamp]
cols = ['startdate','enddate','createddate','timestamp']
exercise[cols].apply(pd.to_datetime)
exercise.head(2)


  exercise[cols].apply(pd.to_datetime)


Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
4216,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,45.04927,41.92541,7.657972,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:24:53 8 Nov 2016
4221,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,11/3/2016,11/4/2016,Public Policy Polling,B+,...,47.457,42.35281,2.199139,,,http://www.publicpolicypolling.com/pdf/2015/PP...,48349,75743,11/4/16,09:24:53 8 Nov 2016


In [24]:
# Convert State and Population to category type
exercise.state = exercise.state.astype('category')
exercise.population = exercise.population.astype('category')
#Export to CSV
exercise.to_csv('../Data-Analysis-Beginner/CSV Files/Exercise-1.csv',index=True)
exercise.head(2)



Unnamed: 0,cycle,branch,type,matchup,forecastdate,state,startdate,enddate,pollster,grade,...,adjpoll_clinton,adjpoll_trump,adjpoll_johnson,adjpoll_mcmullin,multiversions,url,poll_id,question_id,createddate,timestamp
4216,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,New Mexico,11/6/2016,11/6/2016,Zia Poll,,...,45.04927,41.92541,7.657972,,,http://projects.fivethirtyeight.com.s3.amazona...,48614,76158,11/7/16,09:24:53 8 Nov 2016
4221,2016,President,now-cast,Clinton vs. Trump vs. Johnson,11/8/16,Virginia,11/3/2016,11/4/2016,Public Policy Polling,B+,...,47.457,42.35281,2.199139,,,http://www.publicpolicypolling.com/pdf/2015/PP...,48349,75743,11/4/16,09:24:53 8 Nov 2016


## 2 . Exercise & Challenges 
### Car Data

In [25]:
# Read car files from the local folder
cars = pd.read_csv('../Data-Analysis-Beginner/CSV Files/cars.csv')
cars.head(5) # dispaly first 5 records

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [26]:
# Checking the Fueltype total for each gas and diesel category
cars.fueltype.value_counts().head(4)

fueltype
gas       185
diesel     20
Name: count, dtype: int64

In [27]:
# Examining the carName column
cars['CarName'].unique()

array(['alfa-romero giulia', 'alfa-romero stelvio',
       'alfa-romero Quadrifoglio', 'audi 100 ls', 'audi 100ls',
       'audi fox', 'audi 5000', 'audi 4000', 'audi 5000s (diesel)',
       'bmw 320i', 'bmw x1', 'bmw x3', 'bmw z4', 'bmw x4', 'bmw x5',
       'chevrolet impala', 'chevrolet monte carlo', 'chevrolet vega 2300',
       'dodge rampage', 'dodge challenger se', 'dodge d200',
       'dodge monaco (sw)', 'dodge colt hardtop', 'dodge colt (sw)',
       'dodge coronet custom', 'dodge dart custom',
       'dodge coronet custom (sw)', 'honda civic', 'honda civic cvcc',
       'honda accord cvcc', 'honda accord lx', 'honda civic 1500 gl',
       'honda accord', 'honda civic 1300', 'honda prelude',
       'honda civic (auto)', 'isuzu MU-X', 'isuzu D-Max ',
       'isuzu D-Max V-Cross', 'jaguar xj', 'jaguar xf', 'jaguar xk',
       'maxda rx3', 'maxda glc deluxe', 'mazda rx2 coupe', 'mazda rx-4',
       'mazda glc deluxe', 'mazda 626', 'mazda glc', 'mazda rx-7 gs',
       'mazda glc 

In [28]:
# Renaming the column name of car record
# CarName -> brandname and car_ID -> carid

cars = cars.rename(
    columns={
        'CarName':'brandname',
        'car_ID':'carid'
    })
cars.head(5)

Unnamed: 0,carid,symboling,brandname,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [29]:
#Drop / delete carid and symboling columns
cars = cars.drop(columns={'carid','symboling'},errors='ignore')
cars.head(5)

Unnamed: 0,brandname,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,carlength,carwidth,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,audi 100ls,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [30]:
# CarName column should be combination of carname and brandname
