# Machine Learning with Logistic Regression

Second in the ML micro-project series, in this project we will work with a fake advertising data set, indicating whether or not a particular internet user clicked on an advertisement.

We will create a logistic regression model that will predict whether or not a user will click on an ad, based on his/her features. As this is a binary classification problem, a logistic regression model is well suited here.

In [None]:
%pip install seaborn

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<class 'ModuleNotFoundError'>: No module named 'seaborn'

### Data
The data set contains the following features:

- 'Daily Time Spent on Site': consumer time on site in minutes
- 'Age': cutomer age in years
- 'Area Income': Avg. Income of geographical area of consumer
- 'Daily Internet Usage': Avg. minutes a day consumer is on the internet
- 'Ad Topic Line': Headline of the advertisement
- 'City': City of consumer
- 'Male': Whether or not consumer was male
- 'Country': Country of consumer
- 'Timestamp': Time at which consumer clicked on Ad or closed window
- 'Clicked on Ad': 0 or 1 indicated clicking on Ad

In [3]:
ad_data = pd.read_csv('../data/advertising.csv')

<class 'FileNotFoundError'>: [Errno 44] No such file or directory: '../data/advertising.csv'

In [4]:
ad_data.head()

<class 'NameError'>: name 'ad_data' is not defined

In [5]:
ad_data.info()

<class 'NameError'>: name 'ad_data' is not defined

In [6]:
ad_data.describe()

<class 'NameError'>: name 'ad_data' is not defined

### Exploratory Analysis

Checking out the distribution of user age.

In [7]:
plt.hist(ad_data['Age'],bins=30)
plt

<class 'NameError'>: name 'plt' is not defined

Checking out the relationship between age and daily time spent on site.

In [8]:
sns.jointplot(x='Age',y='Daily Time Spent on Site',data=ad_data)

<class 'NameError'>: name 'sns' is not defined

And the relationship between daily time spent on site and daily internet usage.

In [9]:
sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet Usage',data=ad_data)

<class 'NameError'>: name 'sns' is not defined

Finally, a pairplot to visualise everything else, colored on the basis of whether they clicked the ad or not.

In [10]:
sns.pairplot(ad_data,hue='Clicked on Ad')

<class 'NameError'>: name 'sns' is not defined

### Model Building

We'll split the data into training set and testing set using train_test_split, but first, let's convert the 'Country' feature to an acceptable form for the model.

In [11]:
ad_data.columns

<class 'NameError'>: name 'ad_data' is not defined

As we can't directly use the 'Country' feature (because it's a categorical string), we have to find another way to feed it into the model.

One way to go about this is to drop the feature, but we risk losing useful information.

So, what we can do is, convert the categorical feature into dummy variables using pandas.

In [12]:
countries = pd.get_dummies(ad_data['Country'],drop_first=True)

<class 'NameError'>: name 'ad_data' is not defined

Concatenating dummy variables with the original dataset, and dropping other features.

In [13]:
ad_data = pd.concat([ad_data,countries],axis=1)

<class 'NameError'>: name 'ad_data' is not defined

In [14]:
ad_data.drop(['Country','Ad Topic Line','City','Timestamp'],axis=1,inplace=True)

<class 'NameError'>: name 'ad_data' is not defined

Splitting the dataset.

In [15]:
X = ad_data.drop('Clicked on Ad',axis=1)
y = ad_data['Clicked on Ad']

<class 'NameError'>: name 'ad_data' is not defined

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

<class 'NameError'>: name 'X' is not defined

Training the model.

In [18]:
from sklearn.linear_model import LogisticRegression

In [19]:
logclf = LogisticRegression()

In [20]:
logclf.fit(X,y)

<class 'NameError'>: name 'X' is not defined

### Predictions and Evaluations

In [21]:
predictions = logclf.predict(X_test)

<class 'NameError'>: name 'X_test' is not defined

Classification report for the model:

In [22]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

<class 'NameError'>: name 'y_test' is not defined