### Encode categorical variables using Dummy Variables Method

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,Kolkata,2600,550000
1,Kolkata,3000,565000
2,Kolkata,3200,610000
3,Kolkata,3600,680000
4,Kolkata,4000,725000
5,Hyderabad,2600,585000
6,Hyderabad,2800,615000
7,Hyderabad,3300,650000
8,Hyderabad,3600,710000
9,Bangalore,2600,575000


### We now convert these categorical columns into numeric type using get_dummies() method

In [3]:
#create dummy variables
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,Bangalore,Hyderabad,Kolkata
0,False,False,True
1,False,False,True
2,False,False,True
3,False,False,True
4,False,False,True
5,False,True,False
6,False,True,False
7,False,True,False
8,False,True,False
9,True,False,False


Each row represents a house from your dataset. The columns Bangalore, Hyderabad, and Kolkata are the dummy variables.
'True' in any of these columns indicates that the house is located in that town, while 'False' means it's not in that town. For example, the first row (index 0) has a 'True' in the Kolkata column, indicating that this house is in Kolkata.

### But let us get the binary form instead of boolean values so we do - 

In [4]:
dummies = dummies.astype(int)
dummies

Unnamed: 0,Bangalore,Hyderabad,Kolkata
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
5,0,1,0
6,0,1,0
7,0,1,0
8,0,1,0
9,1,0,0


### To avoid dummy variable trap, let's remove 'Hyderabad'


In [5]:
dummies = dummies.drop(['Hyderabad'], axis='columns')
dummies

Unnamed: 0,Bangalore,Kolkata
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
5,0,0
6,0,0
7,0,0
8,0,0
9,1,0


#### Now, the dataset contains only two dummy variables: Bangalore and Kolkata. 
#### The presence of a house in Hyderabad is implicitly indicated by 'False' in both these columns. This approach helps to avoid the dummy variable trap, which can cause multicollinearity in the model.

In [6]:
merged = pd.concat([df,dummies], axis='columns')
merged

Unnamed: 0,town,area,price,Bangalore,Kolkata
0,Kolkata,2600,550000,0,1
1,Kolkata,3000,565000,0,1
2,Kolkata,3200,610000,0,1
3,Kolkata,3600,680000,0,1
4,Kolkata,4000,725000,0,1
5,Hyderabad,2600,585000,0,0
6,Hyderabad,2800,615000,0,0
7,Hyderabad,3300,650000,0,0
8,Hyderabad,3600,710000,0,0
9,Bangalore,2600,575000,1,0


In this merged DataFrame, the columns Bangalore and Kolkata are the dummy variables representing the towns. A value of 1 indicates the house is in that town, and a 0 indicates it is not. Note that Hyderabad is represented implicitly (when both Bangalore and Kolkata are 0). 
This DataFrame is now ready for use in machine learning models, where the town names have been effectively encoded.

### We do not require 'town' variable as it is replaced by dummy variables, hence we drop 'town'

In [7]:
final = merged.drop(['town'], axis='columns')
final

Unnamed: 0,area,price,Bangalore,Kolkata
0,2600,550000,0,1
1,3000,565000,0,1
2,3200,610000,0,1
3,3600,680000,0,1
4,4000,725000,0,1
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,1,0


### 'Price' is the target column to be predicted, so we drop it.

In [8]:
x = final.drop(['price'],axis='columns')
x

Unnamed: 0,area,Bangalore,Kolkata
0,2600,0,1
1,3000,0,1
2,3200,0,1
3,3600,0,1
4,4000,0,1
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,1,0


In [9]:
y = final['price']
y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

### Internal Handling in Linear Regression Models:
Some linear regression implementations, such as those in certain machine learning libraries, are designed to detect and handle multicollinearity internally. They can automatically exclude one of the dummy variables to avoid the dummy variable trap, ensuring that the model does not suffer from multicollinearity issues.
This means that even if you include all dummy variables (one for each category of the categorical variable) in your model, the algorithm might implicitly drop one of them during the training process. It's a form of built-in safeguard against multicollinearity that simplifies model specification, as you don't need to manually drop one of the dummy variables.

### Implication:
You can provide all dummy variables to the model without manually omitting one to prevent multicollinearity. The model will adjust itself by internally excluding one dummy variable.
However, it's always good practice to understand the data you're working with and manually handle dummy variables, as this practice can provide more control over the model and ensure clarity in your model-building process.

In [14]:
model = LinearRegression()
#Train the model
model.fit(x,y)

### Predict the price of house with 2800sqft area located at Bangalore
#### Parameters: [Area, Kolkata, Bangalore]

In [11]:
model.predict([[2800,0,1]])

array([565089.22812299])

### Predict the price of house with 3400sqft area at Hyderabad

In [12]:
model.predict([[3400,0,0]])

array([681241.66845839])

### Finding the accuracy of the model

In [13]:
model.score(x,y)

0.9573929037221873