# GoGreen increases x% participation in future Marketing campaigns with Machine Learning

## Executive summary


## Background

GoGreen is a company in the solar industry.  It has hired us as consultants to help them understand which factors contribute to the success of future marketing campaigns.  Currently the company is trying to reduce marketing efforts that result in zero return on investment (ROI).
We are also tasked to answer the following questions: What does the Average customer look like for our company? What Products and Channels of Revenue are best performing? Which Marketing Campaigns were most successful? 

## Overview of Approach

We will answer these questions through data visualization, statistical analysis, and several machine learning models leveraging Python.

### Data Dictionary

The following are the fields that we have in our data:


- `ID`: the unique identification code for every customer
- `Year_Birth`: The Year of a customer's birth
- `Education`: The level of education that a customer completed
- `Marital_Status`: Status of Marriage
- `Income`: Annual Income
- `Kidhome`: # of children under the age of 13 in Customer's household
- `Teenhome`: # of children between 13-19 in Customer's household
- `Dt_Customer`: Date of Customer Enrollment
- `Recency`: # of days since last purchase
- `MntWines`: Dollar amount of Wines purchased in last 2 years
- `MntFruits`: Dollar amount of Fruits purchased in last 2 years
- `MntMeatProducts`: Dollar amount of Meat products purchased in the last 2 years
- `MntFishProducts`: Dollar amount of Fish products purchased in the last 2 years
- `MntSweetProducts`: Dollar amount of Sweet products purchased in the last 2 years
- `MntGoldProds`: Dollar amount of Gold products purchased in the last 2 years
- `NumDealsPurchases`: # of purchases made with discount
- `NumWebPurchases`: # of purchases made through the company's website
- `NumCatalogPurchases`: # of purchases made using the catalog
- `NumStorePurchases`: # of purchases made directly in-store
- `NumWebVisitsMonth`: # of visits made through company's website
- `AcceptedCmp1`: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- `AcceptedCmp2`: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- `AcceptedCmp3`: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- `AcceptedCmp4`: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- `AcceptedCmp5`: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- `Complain`: 1 if customer complained in the last 2 years, 0 otherwise
- `Response`: 1 if customer accepted the offer in the last campaign, 0 otherwise


## Let's start coding!

### Import Libraries

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

In [2]:
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [3]:
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

In [4]:
# To supress numerical display in scientific notations
pd.set_option("display.float_format", lambda x: "%.2f" % x)

In [5]:
# Avoid displaying warnings
import warnings

In [6]:
# Machine Learning Libraries



In [7]:
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

<IPython.core.display.Javascript object>

In [8]:
# loading the dataset
df = pd.read_excel("marketing_campaign.xlsx")

<IPython.core.display.Javascript object>

In [9]:
# checking shape of the data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset")

There are 2240 rows and 29 columns in the dataset


<IPython.core.display.Javascript object>

In [10]:
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0


<IPython.core.display.Javascript object>

In [32]:
# to view last 5 rows of the dataset
df.tail(100)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
2140,10432,1974,Graduation,Divorced,19346.0,1,0,2014-01-30,26,2,0,9,3,6,2,1,1,0,3,8,0,0,0,0,0,0,3,11,0
2141,9216,1971,Graduation,Married,35788.0,1,1,2014-01-23,34,23,2,11,3,1,4,2,2,0,3,6,0,0,0,0,0,0,3,11,0
2142,7124,1968,Graduation,Divorced,36997.0,1,1,2013-02-01,72,43,4,12,8,0,27,5,2,1,4,5,0,0,0,0,0,0,3,11,0
2143,9727,1957,Graduation,Married,23539.0,0,0,2014-02-28,13,4,24,11,16,1,25,1,2,0,4,6,0,0,0,0,0,0,3,11,0
2144,5136,1973,Graduation,Single,65333.0,0,1,2014-01-17,58,654,7,92,0,15,30,7,9,4,8,6,0,1,1,0,0,0,3,11,0
2145,9790,1957,Graduation,Single,78499.0,0,0,2013-11-23,12,912,72,170,47,36,97,1,11,3,4,4,0,0,1,0,0,0,3,11,1
2146,1818,1971,PhD,Together,29732.0,1,0,2014-03-25,23,25,0,8,0,1,4,1,2,0,2,9,0,0,0,0,0,0,3,11,0
2147,1100,1960,Master,Together,41275.0,1,2,2014-03-24,33,24,4,22,0,2,9,4,3,1,3,5,0,0,0,0,0,0,3,11,0
2148,7873,1973,PhD,Together,63516.0,1,1,2013-07-06,30,141,11,114,15,14,5,4,4,1,7,5,0,0,0,0,0,0,3,11,0
2149,10609,1962,PhD,Married,42769.0,0,1,2013-10-12,15,71,0,13,3,1,0,2,1,1,4,4,0,0,0,0,0,0,3,11,0


<IPython.core.display.Javascript object>

In [12]:
# let's create a copy of the data to avoid any changes to original data
data = df.copy()

<IPython.core.display.Javascript object>

In [13]:
# checking for duplicate values in the data
data.duplicated().sum()

0

<IPython.core.display.Javascript object>

- There are no duplicate values in the data.

In [14]:
# checking the names of the columns in the data
print(data.columns)

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')


<IPython.core.display.Javascript object>

In [15]:
# checking column datatypes and number of non-null values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

<IPython.core.display.Javascript object>

* Dependent variable is the Response of a client to a campaign, which if of *int* type.
* `Education`, `Marital_Status`, `Dt_Customer` are of *object* type.
* All other columns are numeric in nature.
* There are missing values in the `income` column.

## Fixing the data types
* `Year_Birth` should be converted in a categorical data type by dividing the column in generations. EX. Millenials, Baby Boomers, GenZ. 

In [16]:
conditions = [
    (df["Year_Birth"] >= 1893) & (df["Year_Birth"] <= 1923),
    (df["Year_Birth"] >= 1924) & (df["Year_Birth"] <= 1945),
    (df["Year_Birth"] >= 1946) & (df["Year_Birth"] <= 1964),
    (df["Year_Birth"] >= 1965) & (df["Year_Birth"] <= 1980),
    (df["Year_Birth"] >= 1981) & (df["Year_Birth"] <= 1996),
    (df["Year_Birth"] >= 1997) & (df["Year_Birth"] <= 2012),
    (df["Year_Birth"] >= 2013) & (df["Year_Birth"] <= 2025),
]

values = [
    "Error",
    "96-100 years old - Silent",
    "59-95 - Boomer",
    "43-58 - GenX",
    "27-57 - Millenialls",
    "11-26 GenZ",
    "0-10 - GenA",
]

<IPython.core.display.Javascript object>

In [17]:
# create new column age_group
data["age_group"] = np.select(conditions, values)

<IPython.core.display.Javascript object>

In [18]:
# Let's drop Year_Birth from our dataset
data.drop(columns=["Year_Birth"], inplace=True)

<IPython.core.display.Javascript object>

In [19]:
data.head(2)

Unnamed: 0,ID,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,age_group
0,5524,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1,59-95 - Boomer
1,2174,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0,59-95 - Boomer


<IPython.core.display.Javascript object>

* `Dt_Customer` should be converted in a categorical data type.

In [58]:
conditions = [
    (df["Dt_Customer"] < "2012-01-01"),
    (df["Dt_Customer"] >= "2012-01-01") & (df["Dt_Customer"] <= "2012-01-31"),
    (df["Dt_Customer"] >= "2013-01-01") & (df["Dt_Customer"] <= "2013-01-31"),
    (df["Dt_Customer"] >= "2014-01-01") & (df["Dt_Customer"] <= "2014-01-31"),
]

values = ["Before 2012", "2012", "2013", "2014"]

<IPython.core.display.Javascript object>

In [59]:
# create new column age_group
data["date_customer"] = np.select(conditions, values)

<IPython.core.display.Javascript object>

In [60]:
data["date_customer"].value_counts()

0       2043
2013     107
2014      90
Name: date_customer, dtype: int64

<IPython.core.display.Javascript object>

In [61]:
data.drop(columns="Dt_Customer", inplace=True)

<IPython.core.display.Javascript object>

In [63]:
data.head(100)

Unnamed: 0,ID,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,age_group,date_customer
0,5524,Graduation,Single,58138.0,0,0,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1,59-95 - Boomer,0
1,2174,Graduation,Single,46344.0,1,1,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0,59-95 - Boomer,0
2,4141,Graduation,Together,71613.0,0,0,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0,43-58 - GenX,0
3,6182,Graduation,Together,26646.0,1,0,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0,27-57 - Millenialls,0
4,5324,PhD,Married,58293.0,1,0,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0,27-57 - Millenialls,2014
5,7446,Master,Together,62513.0,0,1,16,520,42,98,0,42,14,2,6,4,10,6,0,0,0,0,0,0,3,11,0,43-58 - GenX,0
6,965,Graduation,Divorced,55635.0,0,1,34,235,65,164,50,49,27,4,7,3,7,6,0,0,0,0,0,0,3,11,0,43-58 - GenX,0
7,6177,PhD,Married,33454.0,1,0,32,76,10,56,3,1,23,2,4,0,4,8,0,0,0,0,0,0,3,11,0,27-57 - Millenialls,0
8,4855,PhD,Together,30351.0,1,0,19,14,0,24,3,3,2,1,3,0,2,9,0,0,0,0,0,0,3,11,1,43-58 - GenX,0
9,5899,PhD,Together,5648.0,1,1,68,28,0,6,1,1,13,1,1,0,0,20,1,0,0,0,0,0,3,11,0,59-95 - Boomer,0


<IPython.core.display.Javascript object>

* `Education`,`Marital_Status`, `age_group` are of object type, we can change them to categories.

* `AcceptedCmp1`, `AcceptedCmp2`, `AcceptedCmp3`, `AcceptedCmp4`,`AcceptedCmp5`, `Complain`, `Response` are numerical but we can convert them to "category" as well.

* *Coverting "objects" to "category" reduces the data space required to store the dataframe*

In [64]:
data["Education"] = data["Education"].astype("category")
data["Marital_Status"] = data["Marital_Status"].astype("category")
data["age_group"] = data["age_group"].astype("category")
data["AcceptedCmp1"] = data["AcceptedCmp1"].astype("category")
data["AcceptedCmp2"] = data["AcceptedCmp2"].astype("category")
data["AcceptedCmp3"] = data["AcceptedCmp3"].astype("category")
data["AcceptedCmp4"] = data["AcceptedCmp4"].astype("category")
data["AcceptedCmp5"] = data["AcceptedCmp5"].astype("category")
data["Complain"] = data["Complain"].astype("category")
data["Response"] = data["Response"].astype("category")
data["date_customer"] = data["date_customer"].astype("category")

<IPython.core.display.Javascript object>

In [65]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   ID                   2240 non-null   int64   
 1   Education            2240 non-null   category
 2   Marital_Status       2240 non-null   category
 3   Income               2216 non-null   float64 
 4   Kidhome              2240 non-null   int64   
 5   Teenhome             2240 non-null   int64   
 6   Recency              2240 non-null   int64   
 7   MntWines             2240 non-null   int64   
 8   MntFruits            2240 non-null   int64   
 9   MntMeatProducts      2240 non-null   int64   
 10  MntFishProducts      2240 non-null   int64   
 11  MntSweetProducts     2240 non-null   int64   
 12  MntGoldProds         2240 non-null   int64   
 13  NumDealsPurchases    2240 non-null   int64   
 14  NumWebPurchases      2240 non-null   int64   
 15  NumCatalogPurchases  

<IPython.core.display.Javascript object>

* *we can see that the memory usage has decreased from 507.6+KB to 390.9+KB*

**Let's check for missing values in the data.**

# checking for missing values in the data

In [66]:
data.isnull().sum()

ID                      0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
age_group               0
date_customer           0
dtype: int64

<IPython.core.display.Javascript object>

* `income` column has 24 missing values. It represents 1% of total data in the column.
* No other column has missing values.

In [67]:
# Let's look at the statistical summary of the data
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,2240.0,5592.16,3246.66,0.0,2828.25,5458.5,8427.75,11191.0
Income,2216.0,52247.25,25173.08,1730.0,35303.0,51381.5,68522.0,666666.0
Kidhome,2240.0,0.44,0.54,0.0,0.0,0.0,1.0,2.0
Teenhome,2240.0,0.51,0.54,0.0,0.0,0.0,1.0,2.0
Recency,2240.0,49.11,28.96,0.0,24.0,49.0,74.0,99.0
MntWines,2240.0,303.94,336.6,0.0,23.75,173.5,504.25,1493.0
MntFruits,2240.0,26.3,39.77,0.0,1.0,8.0,33.0,199.0
MntMeatProducts,2240.0,166.95,225.72,0.0,16.0,67.0,232.0,1725.0
MntFishProducts,2240.0,37.53,54.63,0.0,3.0,12.0,50.0,259.0
MntSweetProducts,2240.0,27.06,41.28,0.0,1.0,8.0,33.0,263.0


<IPython.core.display.Javascript object>

* We have 2,240 clients

* We can see that the `Response` is a binary of 1 and 0.  14% of our clients engaged in the campaign.

* 50% of our clients were born before 1970.

* 50% of our clients' `income` sits below $51,000.  However, `income` has a wide range from 1730 to 666666

* The product category that our clients invest the greatest dollar amoount is wines.  

* At least 50% of our clients have spent 173 or less in wines.  

* The dollar amount each client spends in wine has a wide range 0 to  1493. 
    - This might be due to outliers in our data

* The second best product category in dollar amoount is Meat.  50% of our clients have spent 67 or less in meat.

* At least 50% of our clients have 0 kids at home

* At least 50% of our clients have:

    * 6 web visits a month

    * 5 in-store purchases

    * 2 catalog purchases

    * 4 web purchases

    * 2 num deals purchases

    * 0 teens at home


**Let's look at the non-numeric columns.**

In [68]:
# filtering non-numeric columns
cat_columns = data.select_dtypes(exclude=np.number).columns
cat_columns

Index(['Education', 'Marital_Status', 'AcceptedCmp3', 'AcceptedCmp4',
       'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response',
       'age_group', 'date_customer'],
      dtype='object')

<IPython.core.display.Javascript object>

In [69]:
# printing the number of occurrences of each unique value in each categorical column
for column in cat_columns:

    print(data[column].value_counts() / len(data) * 100)
    print("-" * 50)

Graduation   50.31
PhD          21.70
Master       16.52
2n Cycle      9.06
Basic         2.41
Name: Education, dtype: float64
--------------------------------------------------
Married    38.57
Together   25.89
Single     21.43
Divorced   10.36
Widow       3.44
Alone       0.13
YOLO        0.09
Absurd      0.09
Name: Marital_Status, dtype: float64
--------------------------------------------------
0   92.72
1    7.28
Name: AcceptedCmp3, dtype: float64
--------------------------------------------------
0   92.54
1    7.46
Name: AcceptedCmp4, dtype: float64
--------------------------------------------------
0   92.72
1    7.28
Name: AcceptedCmp5, dtype: float64
--------------------------------------------------
0   93.57
1    6.43
Name: AcceptedCmp1, dtype: float64
--------------------------------------------------
0   98.66
1    1.34
Name: AcceptedCmp2, dtype: float64
--------------------------------------------------
0   99.06
1    0.94
Name: Complain, dtype: float64
-----------------

<IPython.core.display.Javascript object>

- Master, PHD and Graduation represent more than 71% of our clients.
- Top three marital status groups: 38% of our clients are married, 25% are together, and 21% are single.

### We will drop the missing values in the dataset.

In [70]:
data.dropna(inplace=True)
data.shape

(2216, 29)

<IPython.core.display.Javascript object>

## Let's visualize the data

### Univariate Analysis

In [71]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

<IPython.core.display.Javascript object>

### `rating`

In [None]:
a = data.columns

In [None]:
for i in a
    histogram_boxplot(data, i)

### <a id='link1'>Summary of EDA</a>

**Data Description:**

- The target variable (`rating`) is of *float* type.
- `title`, `description`, `mediaType`, `sznOfRelease`, and `studio_primary` are of *object* type.
- `ongoing` column is of *bool* type.
- All other columns are numeric in nature.
- The `title` and `description` columns are dropped for modeling as they are highly textual in nature.
- There are no duplicate values in the data.
- There are missing values in the data. The rows with missing data have been dropped.


**Observations from EDA:**

- `rating`: The anime ratings are close to normally distributed, with a mean rating of 2.74. The rating increases with an increase in the number of people who have watched or want to watch the anime.
- `eps`: The distribution is heavily right-skewed as there are many anime movies in the data (at least 50%), and they are considered to be of only one episode as per data description. The number of episodes increases as the anime runs for more years.
- `duration`: The distribution is right-skewed with a median anime runtime of less than 10 minutes. With the increase in rating, the duration increases.
- `years_running`: The distribution is heavily right-skewed, and at least 75% of the anime have run for less than 1 year.
- `watched`: The distribution is heavily right-skewed, and most of the anime have less than 500 viewers. This attribute is highly correlated with the `wantWatch` and `votes` attributes.
- `watching`: The distribution is heavily right-skewed and highly correlated with the `dropped` attribute.
- `wantWatch`: The distribution is heavily right-skewed with a median value of 132 potential watchers.
- `dropped`: The distribution is heavily right-skewed with a drop of 25 viewers on average.
- `votes`: The distribution is heavily right-skewed, and few shows have more than 5000 votes.
- `mediaType`: 23% of the anime are published for TV, 17% as music videos, and 14% as web series. Anime available as TV series, web series, or music videos have a lower rating in general
- `ongoing`: Less than 1% of the anime in the data are ongoing.
- `sznOfRelease`: The season of release is missing for nearly 90% of the anime in the data, and is spread out almost evenly across all seasons when available. Anime ratings have a similar distribution across all the seasons of release.
- `studio_primary`: Nearly 40% of the anime in the data are produced by studios not listed in the data. Toei Animation is the most common studio among the available studio names. In general, the ratings are low for anime produced by DLE studios and studios other than the ones listed in the data.
- `studios_colab`: More than 95% of the anime in the data do not involve collaboration between studios.
- `contentWarn`: Less than 10% of the anime in the data have an associated content warning.
- `tag_<tag/genre>`: There are 1747 anime that are based on manga, 1920 of the Comedy genre, 1238 of the Action genre, 1079 anime of the Romance genre, and more.

## Analysis

## Conclusions

## Recommendations