<a href="https://colab.research.google.com/github/Denis060/NSDC-Data-Science-Project-Ad-Targeting/blob/main/Ad_Targeting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">
    NSDC Data Science Projects
</h1>
  
<h2 align="center">
    Project: Ad Targeting
</h2>

<h3 align="center">
    Name: (insert your name here)
</h3>

**Description:** This project aims to analyze advertising strategies to understand, and potentially improve, engagement. We'll walk through each step of the data science process, from problem definition to insights and data-driven recommendations.



---



### **Please read before you begin your project**

**Instructions: Google Colab Notebooks:**

Google Colab is a free cloud service. It is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources. We will be using Google Colab for this project.

In order to work within the Google Colab Notebook, **please start by clicking on "File" and then "Save a copy in Drive."** This will save a copy of the notebook in your personal Google Drive.

Please rename the file to "Ad Targeting - Your Full Name." Once this project is completed, you will be prompted to share your file with the National Student Data Corps (NSDC) Project Leaders.

You can now start working on the project. :)

We'll be using Google Colab for this assignment. This is a Python Notebook environment built by Google that's free for everyone and comes with a nice UI out of the box. For a comprehensive guide, see Colab's official guide [here](https://colab.research.google.com/github/prites18/NoteNote/blob/master/Welcome_To_Colaboratory.ipynb).

Colab QuickStart:
- Notebooks are made up of cells, cells can be either text or code cells. Click the +code or +text button at the top to create a new cell
- Text cells use a format called [Markdown](https://www.markdownguide.org/getting-started/). Cheatsheet is available [here](https://www.markdownguide.org/cheat-sheet/)
- Python code is run/executed in code cells. You can click the play button at the top left of a code block (sometimes hidden in the square brackets) to run the code in that cell. You an also hit shift+enter to run the cell that is currently selected. There is no concurrency since cells run one at a time but you can queue up multiple cells
- Each cell will run code individually but memory is shared across a notebook Runtime. You can think of a Runtime as a code session where everything you create and execute is temporarily stored. This means variables and functions are available between cells if you execute one cell before the other (physical ordering of cells does not matter). This also means that if you delete or change the name of something and re-execute the cell, the old data might still exist in the background. If things aren't making sense, you can always click Runtime -> restart runtime to start over.
- Runtimes will persist for a short period of time so you are safe if you lose connection or refresh the page but Google will shutdown a runtime after enough time has past. Everything that was printed out will remain on the page even if the runtime is disconnected
- Google's Runtimes come preinstalled with all the core python libraries (math, rand, time, etc) as well as common data analysis libraries (numpy, pandas, scikitlearn, matplotlib). Simply run `import numpy as np` in a code cell to make it available

### **Defining the Problem:**

We need to clearly articulate the problem we're trying to solve with our ad analysis. Our goal is to determine the effectiveness of different ad features in driving engagement.

Understanding the problem is crucial in any data science project. By defining our objective, we can focus our analysis and ensure that our results are relevant and actionable. In this case, we're looking to understand and improve engagement through targeted advertising.

# **Introduction**
---

In this project, we are working with advertising data from a marketing agency to predict whether a user will click on an advertisement. In the current digital age, online advertising is a major driver of revenue for businesses, but understanding user engagement with ads remains a challenge. This project is particularly relevant in the context of digital marketing, where companies aim to optimize their ad strategies to increase engagement and conversion rates. By applying data science techniques, we can uncover patterns in user behavior that help marketers make data-driven decisions.

Throughout the project, participants will engage in several key tasks: data visualization to explore relationships between variables, data preprocessing (such as feature encoding and scaling), and predictive modeling using logistic regression. They will also analyze model performance using metrics like accuracy, precision, recall, and F1-score. The project is designed for individuals with an intermediate understanding of Python, particularly those familiar with libraries like pandas, seaborn, and scikit-learn. By the end of the project, participants will have developed skills in machine learning model development, feature importance analysis, and model evaluation — key competencies for any aspiring data scientist.

This field is an active area of research in both academia and industry. For those interested in further exploration, studies on [`user behavior prediction in digital marketing`](https://www.researchgate.net/publication/335149938_Predicting_Consumer_Behaviour_in_Digital_Market_A_Machine_Learning_Approach) and research articles on [`click-through rate (CTR) prediction model`](https://paperswithcode.com/task/click-through-rate-prediction#papers-list) can provide deeper insights into advanced methodologies used in this domain.



# **Milestone 1: Data Loading and Preprocessing**
---

**Goal:**
Here, we gather the necessary data for our analysis and ensure it's clean and ready for processing.



Link to the dataset: [Effective Targetting of Advertisments](https://www.kaggle.com/datasets/hiimanshuagarwal/advertising-ef)


The data consists of 10 variables:

`'Daily Time Spent on Site'`, `'Age'`, `'Area Income'`, `'Daily Internet Usage'`, `'Ad Topic Line'`, `'City'`, `'Gender'`, `'Country'`, `'Timestamp'` and `'Clicked on Ad'`.

The main variable we are interested in is `'Clicked on Ad'`. This variable can have two possible outcomes: 0 and 1 where 0 refers to the case where a user didn't click the advertisement, while 1 refers to the scenario where a user clicks the advertisement.

### Data Loading
Let's load and look at the dataset to better understand what variables we're working with

In [None]:
# to avoid the warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# mount google drive to colab in order to load the file from drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# import the libaries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Begin by loading the dataset using pandas and display the first few rows to get an initial view of the data.

In [None]:
# __TODO__: load the data
df = pd.read_csv(_______)    #enter the path of the csv file

# TODO: print the first five rows of the dataframe df
# code here:

- The `pd.read_csv()` function reads the CSV file and creates a DataFrame, which is a two-dimensional labeled data structure.
- By calling `df.head()`, we're displaying the first five rows of the dataset, giving us a quick glimpse of what our data looks like.

Use the `info()` method to get a summary of the dataset, including column names, non-null counts, and data types.

In [None]:
# TODO
# code here:

The `df.info()` method provides crucial information about our dataset:
- We have 1009 entries (rows) in our dataset.
- There are 10 columns, each representing different features of our advertising data.
- The data types include float64 (for numerical data with decimals), int64 (for whole numbers), and object (typically used for text or categorical data).
- We can see that some columns have missing values, as the non-null counts are less than the total number of entries.

**Key observations:**
- `'Daily Time Spent on Site'`, `'Age'`, `'Area Income'`, and `'Daily Internet Usage'` are numerical features.
- `'Ad Topic Line'`, `'City'`, `'Gender'`, `'Country'`, and `'Timestamp'` are categorical or text features.
- `'Clicked on Ad'` is our target variable, represented as an integer (likely 0 for no click and 1 for click).
- There are some missing values in `'Age'`, `'Area Income'`, `'City'`, and `'Country'` columns.

In classification problems, it's important to check if the target variable is balanced. A balanced dataset means that each class (in this case, `'Clicked on Ad' = 1` and `'Clicked on Ad' = 0`) has approximately the same number of instances. If a dataset is imbalanced, it can bias the model toward predicting the majority class more often, which can lead to poor performance on the minority class.

Understanding the structure and content of our dataset is crucial for further analysis and preprocessing steps. It helps us identify potential issues like missing values and guides our feature engineering and model selection processes.

In [None]:
# TODO: checking if we have a balanced dataset
# Code here:


# (hint: use value_counts())

### Handle Missing Values

**Why Check For Missing Values?**

- Identifying missing values is a crucial step in data preprocessing. Missing data can significantly impact our analysis and model performance.
- It gives us an idea of the overall quality of our dataset.
- Columns with missing values might be less reliable for our analysis.
- Knowing which columns have missing values helps us decide on appropriate strategies for handling them. For example:
  - For numerical columns like `'Daily Time Spent on Site'` or `'Age'`, we might consider imputing with mean or median values.
  - For categorical columns like `'City'` or `'Country'`, we might create a new category for missing values or use the most frequent value.

- If certain columns have more missing values than others, it could introduce bias into our analysis. We need to consider why these values are missing and if it's related to our target variable (`'Clicked on Ad'`)

Learn about central tendency in detail [here](https://www.youtube.com/watch?v=RHI110MHUCc&list=PLNs9ZO9jGtUDxKBBZa5ImsV9h9hlLwJWH&index=35&pp=gAQBiAQB).

Use the `isna()` method combined with `sum()` to count the number of missing values in each column.

In [None]:
## Check for Missing Values

# Calculate the number of missing values for each column
missing_values = #TODO

print("Number of missing values per column:", missing_values)



---



Now, let's vizualize the missing values per column as a bar graph. Learn about bar graphs [here](https://www.google.com/url?q=https://www.youtube.com/watch?v%3DY_HCxHOy4Sw%26list%3DPLNs9ZO9jGtUDxKBBZa5ImsV9h9hlLwJWH%26index%3D28%26pp%3DgAQBiAQB&sa=D&source=editors&ust=1733322841503749&usg=AOvVaw01PlFyaksZP69knXmIc-fI).

In [None]:
# Plot the missing values as a bar chart
plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')
plt.title('_________')
plt.xlabel('________')
plt.ylabel('________')
plt.xticks(rotation=90)
plt.show()

**Insights:**
- The missing values are relatively small compared to the total dataset size (1009 rows), so we can likely handle them without losing significant information.
- We will need to decide whether to impute these values or drop rows/columns depending on the importance of these features for our analysis.
In the next step, we will explore strategies for handling these missing values, such as imputation or removal.

The following bar graph is almost similar to the one above but it includes the values corresponding to the columns. Which one would you prefer and why?

> *  Answer here:

In [None]:
# Plot the missing values as a bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(missing_values.index, missing_values.values, color='skyblue')

# Add text annotations on top of each bar
# TODO: code here

plt.title('Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=90)
plt.show()

**Strategy for Handling the Missing Values**

Dropping the rows with missing values is not ideal in this case because we do not have enough data. Dropping those rows would lead to a significant loss in information. Instead, we use the following ways to keep that information while dealing with missing values:

- **Numerical Columns:** Imputing with the mean - it ensures that we retain all rows while maintaining the overall distribution of the data.
- **Categorical Columns:** Filling with 'Unknown' allows us to keep information about rows where location data (`'City'` or `'Country'`) was missing, which could be useful for analysis or modeling.


In [None]:
# Fill missing values with mean for the integer columns for simplicity
df['Age'].fillna(df['Age'].mean(), inplace=True)

# TODO: fill the missing values for other numerical columns with the mean
# _____

# TODO: fill missing values in City and Country with 'Unknown'
# ____

After filling in the missing values, check again to ensure that there are no remaining missing values.

In [None]:
# TODO: check missing values again after filling
# _______

### Cleaning the dataset

Looking at the data, we can see that we need to convert the `Timestamp` column into a more usable format and extract additional features such as `Hour`, `Day`, and `Month`.

- The first step is to convert the `Timestamp` column from a `string` format to a proper `datetime` format using `pd.to_datetime()`.
- After converting the `Timestamp` column to `datetime`, we can extract the desired features.

In [None]:
# Convert 'Timestamp' to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# Extract additional features from 'Timestamp'
df['Hour'] = df['_______'].dt.hour
df['Day'] = df['Timestamp'].dt.___
df['_____'] = df['Timestamp'].dt.month

In [None]:
# print head to see the new features
df.head()

With this step completed, our dataset is now clean and ready for further analysis!

# **Milestone 2: Exploratory Data Visualization**
---

It's time to explore the data visually. Visualizing the data helps us understand patterns, distributions, and relationships between variables. Here is a [data vizualization resource](https://www.youtube.com/playlist?list=PLNs9ZO9jGtUCZ0pzj1OcFN450-COhaWkw).

### Age Distribution Histogram
Plot a histogram of the `'Age'` column to understand its distribution.


This visualization is a histogram of the `'Age'` column, which shows how age is distributed across our dataset. Use the `sns.histplot()` function from Seaborn here, with `bins=20` to divide the age range into 20 intervals, and `kde=True` to overlay a Kernel Density Estimate (KDE) curve that smooths out the distribution. Learn more about KDE [here](https://www.geeksforgeeks.org/kde-plot-visualization-with-pandas-and-seaborn/).

In [None]:
# Age Distribution Histogram
plt.figure(figsize=(__, __))
sns.__
plt.title(__)
plt.xlabel(__)
plt.ylabel(__)
plt.show()

__TODO__: Write your key insights from the above distribution
<br>
**Key Insights:**
> *  
> *  

These insights can help us understand which age groups are most represented in our dataset, which could be important when analyzing ad engagement.

### Age vs. Clicked on Ad
Use a box plot to visualize how age relates to whether a user clicked on an ad or not. Use `sns.boxplot()`

In [None]:
# Age vs. Clicked on Ad
plt.figure(figsize=(_, _))
sns.___
plt.title(___)
plt.xlabel(__)
plt.ylabel(__)
plt.show()

The box plot above compares the age distribution between those who clicked on ads and those who didn't.

**TO DO:** What differences do you observe? Could age be a factor in ad engagement?

> * Answer here:  

Now do the same for Area Income. Create a boxplot for Area Income vs Clicked on Ad

In [None]:
# Area Income vs. Clicked on Ad

# TODO: code here

**TO DO:** What insights do you get from these boxplots and how does it potentially affect the ad targetting decisions?

>* Answer here:

### Violin Plot for Age and Clicked on Ad
Create a violin plot to visualize the distribution of age for users who clicked on ads versus those who did not. Use `sns.violinplot()`.

The violin plot provides a more detailed view of the distribution of the `'Age'` variable for each category of `'Clicked on Ad'` (`0` = did not click, `1` = clicked). <br><br>
Learn more about violin plots [here](https://www.geeksforgeeks.org/violinplot-using-seaborn-in-python/).

In [None]:
# Violin Plot for Age and Clicked on Ad

# TODO
sns.___(___, inner='quart')
# TODO


The violin plot provides a detailed view of the age distribution for each group.

**TO DO:** Do you notice any interesting or repeating patterns or outliers?
> * Answer here:

**TO DO:** Is violin plot more or less detailed than boxplot? Explain.

> * Answer here:


### Average Daily Time Spent on Site by Clicked on Ad
Create a bar plot to show the average daily time spent on the site for users who clicked on an ad (1) versus those who did not (0).

In [None]:
# Average Daily Time Spent on Site by Clicked on Ad
# This chart will show the average daily time spent on the site by whether an ad was clicked.

# TODO: code here


# (hint: use sns.barplot())

**TO DO:** <br>
> * Share your key insights here from this bar chart.

### Stacked Bar Chart of Clicked on Ad by County

In [None]:
# Stacked Bar Chart of Clicked on Ad by County
# This chart shows the count of ad clicks by city.

city_counts = df.groupby(['Country', 'Clicked on Ad']).size().unstack(fill_value=0)
city_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='viridis')
plt.title('Stacked Bar Chart of Clicked on Ad by Country')
plt.xlabel('Country')
plt.ylabel('Count')
plt.show()

**Description:** This stacked bar chart shows the distribution of ad clicks (1 = clicked, 0 = not clicked) across different countries. Each bar represents a country, and the segments within the bars represent the count of users who clicked or did not click on an ad.

**Key Insights:**
- Ad engagement varies across countries, with some countries showing higher proportions of clicks (yellow segments) compared to others.
- The United States has a significantly higher number of users, with both clicked and non-clicked ads being prominent.
- However, the **x-axis labels are crowded, making it difficult to read** individual country names. You may want to rotate the labels or filter out countries with fewer data points for better clarity.

**TO DO:** How would you go about making it a presentable and useful graph?

> * Answer here:

In [None]:
# OPTIONAL TODO: Recreate the graph here to make it more useful!
# Code here:


### Faceted Histogram of Age by Clicked on Ad

In [None]:
# Faceted Histogram of Age by Clicked on Ad
# Faceted histograms allow you to compare distributions across different subsets of data.

# To identify potential bias in the 'Age' column of your dataset:

g = sns.FacetGrid(df, col="Clicked on Ad", height=5, aspect=1)
g.map(sns.histplot, "Age", bins=20, kde=True)
g.set_axis_labels("Age", "Count")
g.set_titles("Clicked on Ad = {col_name}")
g.fig.suptitle('Faceted Histogram of Age by Clicked on Ad', y=1.05)
plt.show()

**Description:** <br>
This faceted histogram shows the distribution of age for users who clicked on an ad (right) versus those who did not (left), with a KDE curve overlay.

**TO DO:**
>* Share your key insights here

Do the same Faceted Histogram of `'Area Income'` by `'Clicked on Ad'`.

In [None]:
# TODO: code here
#

**TO DO:** <br>
> *  Share your key insights here

### Correlation Matrix

In [None]:
# Correlation Matrix

# get the numerical columns from the dataset
num_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

plt.figure(figsize=(10, 8))
correlation_matrix = df[num_cols].___     # use corr() for getting the correlation matrix

sns.heatmap(____, annot=True, fmt="___", cmap='coolwarm', square=True)    # try annot=False and see what difference it makes
plt.title('Correlation Matrix')
plt.show()

**Description:**  <br>
This heatmap shows the correlation between numerical variables in the dataset, with values ranging from -1 to 1. Positive correlations are shown in red, and negative correlations are shown in blue.

**TO DO:** Share 2 key insights here:
> *  
> *  

### Distribution of Daily Internet Usage by Country

In [None]:
# Distribution of Daily Internet Usage by Country

# get the top 10 countries by occurances
top_countries = df['Country'].value_counts().nlargest(10).index

plt.figure(figsize=(12, 6))
plt.boxplot([df[df['Country'] == country]['Daily Internet Usage'] for country in top_countries], labels=top_countries)
plt.xlabel('Country')
plt.ylabel('Daily Internet Usage')
plt.title('Distribution of Daily Internet Usage by Country')
____ = plt.xticks(rotation=45, ha='right')

**Description:** <br>
The box plot shows the distribution of daily internet usage across the top 10 countries with the most data entries in the dataset. Each box represents a country, and the distribution of daily internet usage is visualized through the median (orange line), quartiles, and potential outliers.

**Which ones of these insights do you agree with?:**
- Countries like Afghanistan and Turkey have a narrower range of daily internet usage, indicating less variation among users in these countries.
- Countries like Peru and France have a wider range, with some users having significantly higher daily internet usage.
- There are a few outliers in countries like Australia and Czech Republic, where some users have notably lower or higher internet usage compared to the majority.

**TO DO:**
> *  Write your answer here and explain why.

### Historgram for numerical features

In [None]:
# Historgram for numerical features

sns.set(style="whitegrid")

num_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

plt.figure(figsize=(15, 10))
for i, col in enumerate(num_cols):
    plt.subplot(3, 2, i + 1)
    sns.histplot(df[col].dropna(), bins=10, kde=True)
    plt.title(f'Histogram of {col}')
plt.tight_layout()
plt.show()

**Description:** <br>
These histograms display the distributions of various numerical features in the dataset, such as `'Daily Time Spent on Site'`, `'Age'`, `'Area Income'`, `'Daily Internet Usage'`, and `'Clicked on Ad'`. The KDE (Kernel Density Estimate) curves overlay the histograms to provide a smoother view of the distributions. Learn about skewness in histograms [here](https://www.geeksforgeeks.org/skewness-measures-and-interpretation/).

**TODO:** Fill in the following blanks:

- **Daily Time Spent on Site:** The distribution is slightly `_______`-skewed, with most users spending between `_____` to `____` minutes on the site.
- **Age:** The age distribution is `_____`-skewed, with most users falling between `____` to `___` years old.
- **Area Income:** The income distribution is bell-shaped, with most users earning between USD `_____` and USD `____` annually.
- **Daily Internet Usage:** This feature shows a `______` distribution, with peaks around 125 minutes and 225 minutes.
- **Clicked on Ad:** This is a binary variable, with almost equal counts for 0 (did not click) and 1 (clicked), confirming that the dataset is `_________`.


### Daily Time Spent on Site vs. Daily Internet Usage

In [None]:
# Daily Time Spent on Site vs. Daily Internet Usage

plt.figure(figsize=(10, 6))
plt.scatter(x = ___, y = ___, c = ___, cmap='viridis', alpha=0.7)   # TODO: fill the approproate features for x, y, c
plt.xlabel('Daily Time Spent on Site')
plt.ylabel('Daily Internet Usage')
plt.title('Daily Time Spent on Site vs. Daily Internet Usage')
_ = plt.colorbar(label='Clicked on Ad')

**Description:** <br>
This scatter plot visualizes the relationship between `Daily Time Spent on Site` and `Daily Internet Usage`, with points color-coded based on whether the user clicked on an ad (0 = did not click, 1 = clicked).

**TO DO:** Write your key insights below.
> *  
> *   

**TO DO:** Give one or more actionable insights from the above visualization
> *  
> *   
> *   

# **Milestone 3: Predictive Modeling**
---
In this step, we will
- encode categorical variables
- select relevant features
- split the data into train/test
- scale the data
- train and test the model
- evaluate the model performance.


In [None]:
# import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Copy, Select, Split & Scale

Before proceeding with any transformations, it's a good practice to create a copy of the original dataset, using `df.copy()`. This ensures that any changes (like one-hot encoding or feature extraction) won't affect the original data.

In [None]:
# creating a copy of the dataset so we don't change the original df structure in one-hot-encoding process
df2 = df.____

Categorical variables such as `'City'` and `'Country'` need to be converted into numerical representations for machine learning models. We use one-hot encoding to achieve this, which creates binary columns for each category. Learn more about One-Hot Encoding [here](https://www.geeksforgeeks.org/ml-one-hot-encoding/).

This process converts each unique value in a categorical column into a new binary column. For example, if there are three cities (New York, Los Angeles, Chicago), three new columns will be created (City_New York, City_Los Angeles, City_Chicago), with 1 indicating the presence of that city and 0 otherwise.

`drop_first=True`: This argument prevents multicollinearity by dropping one category from each set of dummy variables.

In [None]:
# Encode categorical variables

# TODO
df2 = pd.get_dummies(df2, columns=['__column1__', '__column2__'], drop_first= ___)

Next, we **select the features** that will be used in our predictive model.

And finally, we prepare the feature matrix **`X`** and target vector **`y`**.

In [None]:
# Select features for modeling
features = ['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Hour', 'Day', 'Month'] + [col for col in df.columns if col.startswith(('City_', 'Country_'))]

X = df2[features]
y = df2['Clicked on Ad']

**Splitting the data:** <br>
We use `train_test_split()` from the `sklearn.model_selection` module to split the data into training and testing sets. The `test_size` parameter should be set to `0.2`, meaning 20% of the data will be used for testing, while 80% will be used for training. A `random_state` is set to ensure reproducibility.

In [None]:
# TODO
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=___, random_state=42)

**Scaling the data:** <br>
Next, we scale the features using `StandardScaler` from `sklearn.preprocessing`. Standardization rescales the data so that each feature has a mean of `0` and a standard deviation of `1`. This is particularly important when using models that rely on distance metrics (e.g., logistic regression, k-nearest neighbors).

Learn more about standardization [here](https://www.geeksforgeeks.org/what-is-standardization-in-machine-learning/).

In [None]:
# TODO: Scale the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(_____)
X_test_scaled = scaler.transform(_____)

Now our data is ready for training and testing.

### Train and Test

Now, we use the `LogisticRegression` model to fit on our train data, and then make predictions on the test data using `model.predict`.

In [None]:
# Logistic Regression Model

# TODO: Train logistic regression model
lr_model = _______(random_state=42)
lr_model.fit(______, ______)      # fill the parameters carefully

# TODO: Make predictions
y_pred = lr_model._______(_______)    # fill the parameters carefully

### Evaluation

In [None]:
# Evaluate the model

# TODO: evaluate accuracy score
print("Accuracy:")
______(____, ____)    # fill the parameters carefully

In [None]:
# TODO: get confusion matrix

print("Confusion Matrix:")
______(____, ____)

In [None]:
# TODO: get classification report

print("Classification Report:")
______(____, ____)

**Reading the Evaluation:**

**Accuracy:** <br>
**TODO:** Write here

<br>

**Confusion Matrix:** <br>
**TODO:** Write here


<br>

**Classification Report:** <br>
**TODO:** Write here

<br>

**TODO:** Write conclusion here

### Feature Importance

Based on the model training, we will find out which features hold the most importance in predicting the target variable.

In [None]:
# Feature importance
feature_importance = pd.DataFrame({'Feature': features, 'Importance': abs(lr_model.coef_[0])})
feature_importance = feature_importance.sort_values('Importance', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance.head(5))
plt.title('Top 5 Most Important Features')
plt.show()

**TODO:** Share insights from the above chart
> *  Answer here

**TODO:** Why did we use Logistic regression? Was it a good decision?

> *  Answer here

<h3 align = 'center' >
Thank you for completing the project!
</h3>

Please submit all materials to the NSDC HQ team at er3101@columbia.edu in order to receive a virtual certificate of completion. Do reach out to us if you have any questions or concerns. We are here to help you learn and grow.