```What is categorical data?```

**Categorical data is data that can be classified into distinct categories or groups.
Examples include things like color (red, green, blue), size (small, medium, large),
or even days of the week (Monday, Tuesday, Wednesday).**

```Why one-hot encode?```
**Most machine learning algorithms work with numerical data.
One-hot encoding allows us to convert categorical data into a numerical 
representation that the algorithms can understand. It does this by creating 
a new binary feature (column) for each category in the original data.**

```How does it work?```

**For each categorical variable, one-hot encoding creates a new binary column
for every possible category.Each data point in the new binary columns 
is assigned a 1 if it belongs to that particular category, and a 0 otherwise.
Essentially, you end up with a new table where each row represents a data point,
and each column represents a possible category across all the original categorical variables.**

***Excercise***

This File car sell prices for 3 different models. First plot data points on a scatter plot chart to see if linear regression model can be applied. If yes, then build a model that can answer following questions,

1) Predict price of a mercedez benz that is 4 yr old with mileage 45000

2) Predict price of a BMW X5 that is 7 yr old with mileage 86000

3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())

In [1]:
import pandas as pd 
df = pd.read_csv('carprices.csv')
df

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


### we are using pandas to solve the catagorical data problem

In [4]:
new_df = pd.get_dummies(df['Car Model'])
new_df

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,False,True,False
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
5,True,False,False
6,True,False,False
7,True,False,False
8,True,False,False
9,False,False,True


### Now add this table with the orginal df

In [5]:
merged_df = pd.concat([df,new_df],axis='columns')
merged_df

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,False,True,False
1,BMW X5,35000,34000,3,False,True,False
2,BMW X5,57000,26100,5,False,True,False
3,BMW X5,22500,40000,2,False,True,False
4,BMW X5,46000,31500,4,False,True,False
5,Audi A5,59000,29400,5,True,False,False
6,Audi A5,52000,32000,5,True,False,False
7,Audi A5,72000,19300,6,True,False,False
8,Audi A5,91000,12000,8,True,False,False
9,Mercedez Benz C class,67000,22000,6,False,False,True


### now as we got the catagorical data as boolean numeric data, we are dropping the Car model column

In [9]:
Final_df = merged_df.drop(['Car Model'],axis='columns')
Final_df

Unnamed: 0,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,69000,18000,6,False,True,False
1,35000,34000,3,False,True,False
2,57000,26100,5,False,True,False
3,22500,40000,2,False,True,False
4,46000,31500,4,False,True,False
5,59000,29400,5,True,False,False
6,52000,32000,5,True,False,False
7,72000,19300,6,True,False,False
8,91000,12000,8,True,False,False
9,67000,22000,6,False,False,True


### now to avoid dummy variable trap, drop any of the dummy variable column

In [10]:
Final_df = Final_df.drop(['Mercedez Benz C class'],axis='columns')
Final_df

Unnamed: 0,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5
0,69000,18000,6,False,True
1,35000,34000,3,False,True
2,57000,26100,5,False,True
3,22500,40000,2,False,True
4,46000,31500,4,False,True
5,59000,29400,5,True,False
6,52000,32000,5,True,False
7,72000,19300,6,True,False
8,91000,12000,8,True,False
9,67000,22000,6,False,False


### Now prepare X and Y to train your model
##### our model will be predicting price on the basis of Mileage,age and car type 
##### so X will contain -  Mileage,age and car type 
##### and Y will contain - Price

In [11]:
X = Final_df.drop(['Sell Price($)'],axis='columns')
X

Unnamed: 0,Mileage,Age(yrs),Audi A5,BMW X5
0,69000,6,False,True
1,35000,3,False,True
2,57000,5,False,True
3,22500,2,False,True
4,46000,4,False,True
5,59000,5,True,False
6,52000,5,True,False
7,72000,6,True,False
8,91000,8,True,False
9,67000,6,False,False


In [12]:
Y = Final_df['Sell Price($)']
Y

0     18000
1     34000
2     26100
3     40000
4     31500
5     29400
6     32000
7     19300
8     12000
9     22000
10    20000
11    21000
12    33000
Name: Sell Price($), dtype: int64

In [13]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

#### now training our model

In [14]:
model.fit(X,Y)

In [20]:
#Predict price of a mercedez benz that is 4 yr old with mileage 45000
# 0,0
model.predict([[45000,4,0,0]])



array([36991.31721061])

In [21]:
# Predict price of a BMW X5 that is 7 yr old with mileage 86000
model.predict([[86000,7,0,1]])



array([11080.74313219])

In [25]:
# checking model accuracy
model.score(X,Y)

0.9417050937281082