## Problem Statement:
A cloth manufacturing company is interested to know about the segment or attributes causes high sale. Approach - A Random Forest can be built with target variable Sale (we will first convert it in categorical variable) & all other variable will be independent in the analysis.

## About the data:
Let’s consider a Company dataset with around 10 variables and 400 records. The attributes are as follows:

*  Sales -- Unit sales (in thousands) at each location
*  Competitor Price -- Price charged by competitor at each location
*  Income -- Community income level (in thousands of dollars)
*  Advertising -- Local advertising budget for company at each location (in thousands of dollars)
*  Population -- Population size in region (in thousands)
*  Price -- Price company charges for car seats at each site
*  Shelf Location at stores -- A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
*  Age -- Average age of the local population
*  Education -- Education level at each location
*  Urban -- A factor with levels No and Yes to indicate whether the store is in an urban or rural location
*  US -- A factor with levels No and Yes to indicate whether the store is in the US or not

# 1. Import Necessary Libraries

In [66]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix

# 2.Import Data

In [5]:
company_data=pd.read_csv('Company_Data.csv')
company_data

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.50,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.40,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No
...,...,...,...,...,...,...,...,...,...,...,...
395,12.57,138,108,17,203,128,Good,33,14,Yes,Yes
396,6.14,139,23,3,37,120,Medium,55,11,No,Yes
397,7.41,162,26,12,368,159,Medium,40,18,Yes,Yes
398,5.94,100,79,7,284,95,Bad,50,12,Yes,Yes


# 3.Data Understanding

In [6]:
company_data.shape

(400, 11)

In [31]:
company_data.isnull().sum()

Sales          0
CompPrice      0
Income         0
Advertising    0
Population     0
Price          0
ShelveLoc      0
Age            0
Education      0
Urban          0
US             0
sales          0
dtype: int64

In [32]:
company_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Sales        400 non-null    float64
 1   CompPrice    400 non-null    int64  
 2   Income       400 non-null    int64  
 3   Advertising  400 non-null    int64  
 4   Population   400 non-null    int64  
 5   Price        400 non-null    int64  
 6   ShelveLoc    400 non-null    int32  
 7   Age          400 non-null    int64  
 8   Education    400 non-null    int64  
 9   Urban        400 non-null    int32  
 10  US           400 non-null    int32  
 11  sales        400 non-null    object 
dtypes: float64(1), int32(3), int64(7), object(1)
memory usage: 32.9+ KB


In [33]:
company_data.describe()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,7.496325,124.975,68.6575,6.635,264.84,115.795,1.3075,53.3225,13.9,0.705,0.645
std,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,0.833475,16.200297,2.620528,0.456614,0.479113
min,0.0,77.0,21.0,0.0,10.0,24.0,0.0,25.0,10.0,0.0,0.0
25%,5.39,115.0,42.75,0.0,139.0,100.0,1.0,39.75,12.0,0.0,0.0
50%,7.49,125.0,69.0,5.0,272.0,117.0,2.0,54.5,14.0,1.0,1.0
75%,9.32,135.0,91.0,12.0,398.5,131.0,2.0,66.0,16.0,1.0,1.0
max,16.27,175.0,120.0,29.0,509.0,191.0,2.0,80.0,18.0,1.0,1.0


In [34]:
#Applying encoding for catogorical Data
labelencoder=LabelEncoder()
company_data['ShelveLoc']= labelencoder.fit_transform(company_data['ShelveLoc'])
company_data['Urban']= labelencoder.fit_transform(company_data['Urban'])
company_data['US']= labelencoder.fit_transform(company_data['US'])

In [35]:
company_data

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,sales
0,9.50,138,73,11,276,120,0,42,17,1,1,High
1,11.22,111,48,16,260,83,1,65,10,1,1,High
2,10.06,113,35,10,269,80,2,59,12,1,1,High
3,7.40,117,100,4,466,97,2,55,14,1,1,Low
4,4.15,141,64,3,340,128,0,38,13,1,0,Low
...,...,...,...,...,...,...,...,...,...,...,...,...
395,12.57,138,108,17,203,128,1,33,14,1,1,High
396,6.14,139,23,3,37,120,2,55,11,0,1,Low
397,7.41,162,26,12,368,159,2,40,18,1,1,Low
398,5.94,100,79,7,284,95,0,50,12,1,1,Low


In [36]:
#Average Sales
company_data['Sales'].mean()

7.496325

>>As a cloth manufacturing company is interested to know about the segment or attributes causes high sale,so trying to divide the high sales and low sales

In [37]:
 company_data.loc[company_data['Sales']>=7.49, 'sales']='High'

In [38]:
company_data.loc[company_data['Sales']<7.49,'sales']='Low'

In [39]:
company_data

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US,sales
0,9.50,138,73,11,276,120,0,42,17,1,1,High
1,11.22,111,48,16,260,83,1,65,10,1,1,High
2,10.06,113,35,10,269,80,2,59,12,1,1,High
3,7.40,117,100,4,466,97,2,55,14,1,1,Low
4,4.15,141,64,3,340,128,0,38,13,1,0,Low
...,...,...,...,...,...,...,...,...,...,...,...,...
395,12.57,138,108,17,203,128,1,33,14,1,1,High
396,6.14,139,23,3,37,120,2,55,11,0,1,Low
397,7.41,162,26,12,368,159,2,40,18,1,1,Low
398,5.94,100,79,7,284,95,0,50,12,1,1,Low


## Splitting the dataset into Training and Testing parts

In [40]:
x= company_data.iloc[:,1:-1]
x

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,138,73,11,276,120,0,42,17,1,1
1,111,48,16,260,83,1,65,10,1,1
2,113,35,10,269,80,2,59,12,1,1
3,117,100,4,466,97,2,55,14,1,1
4,141,64,3,340,128,0,38,13,1,0
...,...,...,...,...,...,...,...,...,...,...
395,138,108,17,203,128,1,33,14,1,1
396,139,23,3,37,120,2,55,11,0,1
397,162,26,12,368,159,2,40,18,1,1
398,100,79,7,284,95,0,50,12,1,1


In [41]:
y=company_data['sales']
y

0      High
1      High
2      High
3       Low
4       Low
       ... 
395    High
396     Low
397     Low
398     Low
399    High
Name: sales, Length: 400, dtype: object

In [42]:
y.value_counts()

High    201
Low     199
Name: sales, dtype: int64

In [43]:
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8,random_state=0)

In [44]:
x_train.shape,y_train.shape

((320, 10), (320,))

In [45]:
y_test.shape,y_test.shape

((80,), (80,))

## Model Building

In [78]:
kfold = KFold(n_splits=100,shuffle=True)
model = RandomForestClassifier(n_estimators=50, max_features=4)
results_rfc = cross_val_score(model, x, y, cv=kfold)
print(results_rfc.mean()*100)

80.25


In [60]:
model.fit(x_train,y_train)

RandomForestClassifier(max_features=4, n_estimators=50)

In [84]:
#predicting the model
y_train_pred=classifier.predict(x_train)
y_test_pred= classifier.predict(x_test)

In [85]:
from sklearn.metrics import accuracy_score

In [86]:
accuracy_score(y_train_pred,y_train)

1.0

In [87]:
accuracy_score(y_test_pred,y_test)

0.825

In [88]:
result1 = classification_report(y_test,y_test_pred)
print('Classification Report:')
print(result1)
result2= confusion_matrix(y_test,y_test_pred)
print('Confusion Matrix:')
print(result2)

Classification Report:
              precision    recall  f1-score   support

        High       0.88      0.80      0.84        45
         Low       0.77      0.86      0.81        35

    accuracy                           0.82        80
   macro avg       0.82      0.83      0.82        80
weighted avg       0.83      0.82      0.83        80

Confusion Matrix:
[[36  9]
 [ 5 30]]


### Inference:
   >The accuracy of training data is 100% but the accurary of testing data is 82%

## BIAS - VARIANCE
* Training Error/Accuracy - Bias
* Test Error/Accuracy - Variance.

**Model Overfitting** - Less Bias and High Variance.

**Model Underfitting** - High Bias and Less Variance.

**EXPECTED MODEL -- GENERALIZED MODEL** - Less Bias and Less Variance.

So, there is always a Tradeoff maintaned between Bias and Variance.

### Here my model is overfitting

In [89]:
from sklearn.ensemble import AdaBoostClassifier

In [107]:
#Adaboost Classifier
kfold = KFold(n_splits=100)
model1 = AdaBoostClassifier(n_estimators=100)
results = cross_val_score(model, x, y, cv=kfold)
print(results.mean())

0.7925


In [108]:
model1.fit(x_train,y_train)

AdaBoostClassifier(n_estimators=100)

In [109]:
y_pred=model1.predict(x_test)

In [110]:
accuracy_score(y_pred,y_test)

0.85

In [111]:
#Accuracy has increased from 82% to 85%

# THE END!