##### The challenge will be tackled by following these specific steps
1- load all important relevant libraries and data csv.file 
2- clean the data 
3- understand what I would like to analyze. Hence, I figure out the input and the output
4- create the model
5- Training the model with current data
6- find the accuracy of the model 
7- make output predications based on random inputs 

All the libraries below are required for the purpose of the challenge. All these libraries are installed as following:

$ python3 -m pip install <package name>

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

We loaded the data after copying the csv file in the same directory as jupyter notebook

In [2]:
df = pd.read_csv('DataScienceChallenge.csv')
df

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.90,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.50,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0
...,...,...,...,...,...,...,...,...,...,...
995,72.97,30,71384.57,208.58,Fundamental modular algorithm,Duffystad,1,Lebanon,2016-02-11 21:49:00,1
996,51.3,45,67782.17,134.42,Grass-roots cohesive monitoring,New Darlene,1,Bosnia and Herzegovina,2016-04-22 02:07:01,1
997,51.63,51,42415.72,120.37,Expanded intangible solution,South Jessica,1,Mongolia,2016-02-01 17:24:57,1
998,55.55,19,41920.79,187.95,Proactive bandwidth-monitored policy,West Steven,0,Guatemala,2016-03-24 02:35:54,0


##### in order to clean the data, I have to find the type or class of each variable

In [3]:
df.dtypes

Daily Time Spent on Site     object
Age                           int64
Area Income                 float64
Daily Internet Usage        float64
Ad Topic Line                object
City                         object
Male                          int64
Country                      object
Timestamp                    object
Clicked on Ad                object
dtype: object

 "Daily Time Spent on Site" should be float 
 "Timestamp" should be datetime 
 "Clicked on Ad" should be integer 

In [4]:
#the argument errors = coerce will lead to values of "Nan" when the dtype is not float or integer
df["Daily Time Spent on Site"] = pd.to_numeric(df["Daily Time Spent on Site"], errors='coerce') 
df["Daily Time Spent on Site"] = df["Daily Time Spent on Site"].replace('NaN',0)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
#.convert_dtypes() will convert the dtype of the column to the best possible dtype which is integer
df["Clicked on Ad"] = pd.to_numeric(df["Clicked on Ad"], errors='coerce').convert_dtypes() 

#### Checking all varibles have the best possible dtypes

In [5]:
df.dtypes

Daily Time Spent on Site           float64
Age                                  int64
Area Income                        float64
Daily Internet Usage               float64
Ad Topic Line                       object
City                                object
Male                                 int64
Country                             object
Timestamp                   datetime64[ns]
Clicked on Ad                        Int64
dtype: object

Some rows have missing values or "Nan". Hence, filling the df with 0 instead of Na or Nan solves the issue

In [6]:
df2 = df.fillna(0)

#### Cleaning data is done! Now it is the time to choose variables as features input to give the label variable "Clicked on Ad"
I have chosen numeric variables becuase I would like to work with multivariable linear regression model (classifier logistic regression)

In [7]:
x = df2.drop(columns = ["Ad Topic Line", "City", "Country", "Timestamp", "Clicked on Ad"])
x

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male
0,68.95,35,61833.90,256.09,0
1,80.23,31,68441.85,193.77,1
2,69.47,26,59785.94,236.50,0
3,74.15,29,54806.18,245.89,1
4,68.37,35,73889.99,225.58,0
...,...,...,...,...,...
995,72.97,30,71384.57,208.58,1
996,51.30,45,67782.17,134.42,1
997,51.63,51,42415.72,120.37,1
998,55.55,19,41920.79,187.95,0


In [8]:
y = df2["Clicked on Ad"].astype('int') #astype('int') to ensure the output is numeric 
y

0      0
1      0
2      0
3      0
4      0
      ..
995    1
996    1
997    1
998    0
999    1
Name: Clicked on Ad, Length: 1000, dtype: int64

###### Splitting the data into train and test the model 

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x, y)

#### Initiating the model
The Optimization problem algorithm used is the default solver = "lbfgs".
I tried "newton-cholesky" which doesn't work on multinomial and "sag" which requires more max_iter and resulted in lower model score and accuracy score (~0.72)

In [10]:
model = LogisticRegression(max_iter = 1000, multi_class = "multinomial")
model.fit(x_train, y_train)

#### To test how good the model is 
I did calculate the model score and the accuracy score. A score or accuracy score that is closer to 100% means the model is good. The fact that the score is not 100% means the model is able to predict a good label based on new input features. In other words, I don't have over fitting and I can use the model for new input data samples. 

#### To keep in mind:
score() is the accuracy score based on #of correct predictions/#of all predictions. The score() functions makes predictions of x_test without the need of function predict(). On the otherhand, accuracy_score needs predict() to give yhat (y_predictions) based on x_test. Both give the same score about the model accuracy on predicting. 

In [11]:
score = model.score(x_test, y_test)
yhat = model.predict(x_test)
acc = accuracy_score(yhat, y_test)
print(score, acc)

0.82 0.82


This is just an example to show that the model works. Basically, I gave random values of the specific features and I asked the model to predict the label "Clicked on Ad"

In [12]:
new_input = pd.DataFrame({'Daily Time Spent on Site':[70.5],'Age':[25],'Area Income':[60172.5], 
                          'Daily Internet Usage':198.3, 'Male':0})
new_output = model.predict(new_input)
new_output

array([0])

The end! Hope you enjoy it :) 