# 1. <a id='Introduction'>Introduction 

![foto](https://theyellowcarcompany.com/wp-content/uploads/2019/04/The_Yellow_Car_Company_Sales_TYCC-930x550.jpg)

### The "Individual Company Sales" dataset is a very interesting example of how we can use a variety of customer information to predict the likelihood that he will buy a specific product or not. The product in question is generic so that our analysis can theoretically be applied to any product

### This dataset includes about 40,000 rows and 15 feature variables. Each row corresponds to a customer infomation, and includes the variables:

### 1. flag: Whether the customer has bought the target product or not

### 2. gender: Gender of the customer

### 3. education: Education background of customer

### 4. house_val: Value of the residence the customer lives in

### 5. age: Age of the customer by group

### 6. online: Whether the customer had online shopping experience or not

### 7. customer_psy: Variable describing consumer psychology based on the area of residence

### 8. marriage: Marriage status of the customer

### 9. children: Whether the customer has children or not

### 10. occupation: Career information of the customer

### 11. mortgage: Housing Loan Information of customers

### 12. house_own: Whether the customer owns a house or not

### 13. region: Information on the area in which the customer are located

### 14. car_prob: The probability that the customer will buy a new car(1 means the maximum possible）

### 15. fam_income: Family income Information of the customer(A means the lowest, and L means the highest)

# 2. <a id='importing'>Importing the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# Disable warnings
import warnings
warnings.filterwarnings("ignore")

# Import plotting modules
!pip install chart-studio
import seaborn as sns
sns.set()
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker
import plotly.express as px
from plotly.offline import iplot
from matplotlib import rcParams

import chart_studio.plotly as py
import plotly.graph_objs as go
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
%matplotlib inline

warnings.filterwarnings("ignore")
import plotly.figure_factory as ff
from colorama import Fore, Back, Style 

# Import encoder library
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder 

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

# 3. <a id='reading'>Reading the dataset.csv

In [None]:
# load data
df = pd.read_csv('../input/individual-company-sales-data/sales_data.csv')
df.head()

In [None]:
print(Fore.BLUE + 'Data information ....................',Style.RESET_ALL)
print(df.info())

In [None]:
for i in df.columns:
    print(i, df[i].unique())

# 4. <a id='basic'>Basic Data Exploration

In [None]:
df['gender'] = df.gender.replace('U', np.NaN)
df['age'] = df.age.replace('1_Unk', np.NaN)
df['child'] = df.child.replace('U', np.NaN)
df['child'] = df.child.replace('0', np.NaN)

In [None]:
# Fraction of missing values
df.isnull().sum() / df.shape[0] * 100

In [None]:
# Show the outliers in 'house_val'
plt.figure(figsize = (12, 8))
sns.boxplot(data= df, x = 'house_val')
plt.show()

### Using quantile method to eliminate outliers

In [None]:
# Applying the quantile method
hi_q1 = df['house_val'].quantile(.25)
hi_q3 =df['house_val'].quantile(.75)
iqr = hi_q3 - hi_q1

In [None]:
hi_up = hi_q3 + 1.5*iqr
hi_down = hi_q1 - 1.5*iqr

In [None]:
df0 = df[(df['house_val']> hi_down) & (df['house_val'] < hi_up)]

In [None]:
# Show 'house_val' without outliers
plt.figure(figsize=(12,8))
sns.boxplot(data= df0, x = 'house_val')
plt.show()

In [None]:
# Assigning new dataset for encoder
dff = df0

In [None]:
# Pie plot of house owner
plt.figure(figsize =(7, 7))
df['house_owner'].value_counts().head(10).plot.pie(autopct='%1.1f%%')

# Unsquish the pie.
import matplotlib.pyplot as plt
plt.gca().set_aspect('equal')

[](http://)

In [None]:
# Percentage of null values in house owner
(df.isnull().sum() / df.shape[0] * 100)['house_owner']

### The most house owner are owner, than we can fill missing values with house owner attribute

In [None]:
dff['house_owner'] = dff['house_owner'].fillna(df.mode()['house_owner'][0])

In [None]:
# Pie plot of age
plt.figure(figsize =(7, 7))
df['age'].value_counts().head(10).plot.pie(autopct='%1.1f%%')

# Unsquish the pie.
import matplotlib.pyplot as plt
plt.gca().set_aspect('equal')

In [None]:
# Percentage of null values in age
(df.isnull().sum() / df.shape[0] * 100)['age']

In [None]:
dff = dff.dropna(subset=['age'])

In [None]:
# Pie plot of child
plt.figure(figsize =(7, 7))
df['child'].value_counts().head(10).plot.pie(autopct='%1.1f%%')

# Unsquish the pie.
import matplotlib.pyplot as plt
plt.gca().set_aspect('equal')

In [None]:
# Percentage of null values in child
(df.isnull().sum() / df.shape[0] * 100)['child']

### We don't have dominant categories in 'child', then we can´t fill the missing values. therefore, it is reasonable to disregard the 'child' column.

In [None]:
dff = dff.drop('child', axis=1)

In [None]:
# Pie plot marriage
plt.figure(figsize =(7, 7))
df['marriage'].value_counts().head(10).plot.pie(autopct='%1.1f%%')

# Unsquish the pie.
import matplotlib.pyplot as plt
plt.gca().set_aspect('equal')

In [None]:
# Percentage of null values in marriage
(df.isnull().sum() / df.shape[0] * 100)['marriage']

### More than 80% marriage are marriage, then we can fill missing values with marriage attribute

In [None]:
dff['marriage'] = dff['marriage'].fillna(dff.mode()['marriage'][0])

In [None]:
# Pie plot gender
plt.figure(figsize =(7, 7))
df['gender'].value_counts().head(10).plot.pie(autopct='%1.1f%%')

# Unsquish the pie.
import matplotlib.pyplot as plt
plt.gca().set_aspect('equal')

In [None]:
# Percentage of null values in gender
(df.isnull().sum() / df.shape[0] * 100)['gender']

In [None]:
# Pie plot education
plt.figure(figsize =(7, 7))
df['education'].value_counts().head(10).plot.pie(autopct='%1.1f%%')

# Unsquish the pie.
import matplotlib.pyplot as plt
plt.gca().set_aspect('equal')

In [None]:
# Percentage of null values in education
(df.isnull().sum() / df.shape[0] * 100)['education']

### Since we have small amounts of missing values in the 'education' and 'gender' columns, then we simply drop them.

In [None]:
dff = dff.dropna(subset=['gender', 'education'])

In [None]:
# checking data cleaning
dff.isnull().sum()

### No more missing values

# 5. <a id='details'>Feature Engineering of dataset columns

### We started converting the data set columns that are of the object type into numeric values

In [None]:
dff.dtypes

### Firts coverting  the hierarchy  columns 

In [None]:
# Converting flag and online features to binary integer
dff['flag'] = dff['flag'].apply(lambda value: 1 if value == 'Y' else 0)
dff['online'] = dff['online'].apply(lambda value: 1 if value == 'Y' else 0)

In [None]:
# Converting education to integer
dff['education'] = dff['education'].apply(lambda value: int(value[0]) + 1 )

In [None]:
# Converting age to integer
dff['age'] = dff['age'].apply(lambda value: int(value[0]) - 1 )

In [None]:
# Converting mortgage to integer
dff['mortgage'] = dff['mortgage'].apply(lambda value: int(value[0]))

In [None]:
#fam_income label dictionary
dict_fam_income_label = {}
for i, char in enumerate(sorted(dff['fam_income'].unique().tolist())):
    dict_fam_income_label[char] = i + 1

In [None]:
dff['fam_income'] = dff['fam_income'].apply(lambda value: dict_fam_income_label[value])

### Now, we deal of the columns with dummy variables

In [None]:
dummy_features = ['gender', 'customer_psy', 'occupation', 'house_owner', 'region', 'marriage']

In [None]:
def apply_dummy(dff, i, drop_first=True):


    return pd.concat([dff, pd.get_dummies(dff[i], prefix=i, drop_first=drop_first)], axis=1).drop(i, axis=1)

In [None]:
# Converting dummy features in numerical values
for i in dummy_features:
    dff = apply_dummy(dff, i)

In [None]:
dff.head()

In [None]:
dff.dtypes

### All columns contain numerical values, but note that we have many more columns now, it is a price that we have to pay

In [None]:
# Heatmap of correlation
plt.figure(figsize=(14,14))
sns.heatmap(dff.corr())
plt.show()

###  Looking the heatmap of correlation we can see the most variables exhibit low positive and negative correlation. Remembering tha positive correlation can be definide like: if the value of one of the variables increases, the value of the other variable increases as well. In case negative correlation the value of one variable decreases with the other’s increasing and vice-versa.

# 6. <a id='details'> Using machine learning to predict heart disease

In [None]:
#Splitting the dataset into features and target
y0 = dff["flag"]
x0 = dff.drop("flag", axis = 1)

In [None]:
#Splitting the data into test data and training data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x0, y0, test_size = 0.3)


In [None]:
accuracy_list = []

In [None]:
# Decision Tree Classifier

dt_clf = DecisionTreeClassifier(max_leaf_nodes=10, random_state=30, criterion='entropy')
dt_clf.fit(x_train, y_train)
dt_pred = dt_clf.predict(x_test)
dt_acc = dt_clf.score(x_test,y_test)
accuracy_list.append(100*dt_acc)

In [None]:
print(Fore.GREEN + "Accuracy of Decision Tree Classifier is : ", "{:.2f}%".format(100* dt_acc))

In [None]:
from sklearn.metrics import confusion_matrix
plt.figure(figsize = (8, 8))
mat = confusion_matrix(y_test, dt_pred)
sns.heatmap(mat.T, square=True, annot=True,fmt="d", cbar = False)
plt.title("Decision Tree Clasifier - Confusion Matrix")
plt.xticks(range(2), ["0","1"], fontsize=16)
plt.yticks(range(2), ["0","1"], fontsize=16)
plt.xlabel("true label")
plt.ylabel("predicted label");

In [None]:
# K Neighbors Classifier

kn_clf = KNeighborsClassifier(n_neighbors=6)
kn_clf.fit(x_train, y_train)
kn_pred = kn_clf.predict(x_test)
kn_acc = kn_clf.score(x_test,y_test)
accuracy_list.append(100*kn_acc)


In [None]:
print(Fore.GREEN + "Accuracy of K Neighbors Classifier is : ", "{:.2f}%".format(100* kn_acc))

In [None]:
# Confusion matrix of  K Neighbors Classifier
from sklearn.metrics import confusion_matrix
plt.figure(figsize = (8, 8))
mat = confusion_matrix(y_test, kn_pred)
sns.heatmap(mat.T, square=True, annot=True,fmt="d", cbar = False)
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("K Neighbors Classifier - Confusion Matrix")
plt.xticks(range(2), ["0","1"], fontsize=16)
plt.yticks(range(2), ["0","1"], fontsize=16);

In [None]:
# RandomForestClassifier
r_clf = RandomForestClassifier(max_features=0.5, max_depth=15, random_state=1)
r_clf.fit(x_train, y_train)
r_pred = r_clf.predict(x_test)
r_acc = r_clf.score(x_test,y_test)
accuracy_list.append(100*r_acc)

In [None]:
print(Fore.GREEN + "Accuracy of Random Forest Classifier is : ", "{:.2f}%".format(100* r_acc))

In [None]:
# Confusion matrix of Random Forest Classifier 
from sklearn.metrics import confusion_matrix
plt.figure(figsize = (8, 8))
mat = confusion_matrix(y_test, r_pred)
sns.heatmap(mat.T, square=True, annot=True,fmt="d", cbar = False)
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("Random Forest Classifier - Confusion Matrix")
plt.xticks(range(2), ["0","1"], fontsize=16)
plt.yticks(range(2), ["0","1"], fontsize=16);

In [None]:
# GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
gradientboost_clf = GradientBoostingClassifier(max_depth=2, random_state=4)
gradientboost_clf.fit(x_train,y_train)
gradientboost_pred = gradientboost_clf.predict(x_test)
gradientboost_acc = gradientboost_clf.score(x_test,y_test)
accuracy_list.append(100*gradientboost_acc)

In [None]:
print(Fore.GREEN + "Accuracy of Gradient Boosting is : ", "{:.2f}%".format(100* gradientboost_acc))

In [None]:
# Confusion matrix of Gradient Boosting
from sklearn.metrics import confusion_matrix
plt.figure(figsize = (8, 8))
mat = confusion_matrix(y_test, gradientboost_pred)
sns.heatmap(mat.T, square=True, annot=True,fmt="d", cbar = False)
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title(" Gradient Boosting - Confusion Matrix")
plt.xticks(range(2), ["0","1"], fontsize=16)
plt.yticks(range(2), ["0","1"], fontsize=16);

In [None]:
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()
gaussian.fit(x_train, y_train)
gaussian_pred = gaussian.predict(x_test)
gaussian_acc = gaussian.score(x_test,y_test)
accuracy_list.append(100*gaussian_acc)

In [None]:
print(Fore.GREEN + "Accuracy of Gradient Boosting is : ", "{:.2f}%".format(100* gaussian_acc))

In [None]:
# Confusion matrix of GaussianNB
from sklearn.metrics import confusion_matrix
plt.figure(figsize = (8, 8))
mat = confusion_matrix(y_test, gaussian_pred)
sns.heatmap(mat.T, square=True, annot=True,fmt="d", cbar = False)
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("GaussianNB - Confusion Matrix")
plt.xticks(range(2), ["0","1"], fontsize=16)
plt.yticks(range(2), ["0","1"], fontsize=16);

In [None]:
model_list = ['DecisionTree', 'KNearestNeighbours', 'RandomForest', 'GradientBooster', 'GaussianNB']

In [None]:
plt.rcParams['figure.figsize']=20,8
sns.set_style('darkgrid')
ax = sns.barplot(x=model_list, y=accuracy_list, palette = "vlag", saturation =2.0)
plt.xlabel('Classifier Models', fontsize = 20 )
plt.ylabel('% of Accuracy', fontsize = 20)
plt.title('Accuracy of different Classifier Models', fontsize = 20)
plt.xticks(fontsize = 12, horizontalalignment = 'center', rotation = 8)
plt.yticks(fontsize = 12)
for i in ax.patches:
    width, height = i.get_width(), i.get_height()
    x, y = i.get_xy() 
    ax.annotate(f'{round(height,2)}%', (x + width/2, y + height*1.02), ha='center', fontsize = 'x-large')
plt.show()

### We use five machine learning algorithms to predict whether a customer is likely to buy a particular product or not based on various information about them. The best performing algorithms were GradientBooster and RandomForest with efficiency around 69%. Since the target variable represents a generic product so that we can apply our predictive models to any particular product we want to analyze. As the positive and negative correlations between the variables are not very large, it directly implies the model's prediction efficiency. We can conclude that depending on the correlations, we can obtain great prediction efficiency with the machine learning models.