# 1. Problem 1: Evolution of Sales Volume

### <ins>General notes for the exam: </ins>

You can access all the data via your usual import package (e.g PANDAS in python):

In Python:
```
import pandas as pd
df_product = pd.read_csv('product_data.csv')
```

In R:
```
df_product <- read.csv('product_data.csv')
```

All needed packages are available in the test:
Non exhaustive list: pandas, numpy, sklearn, scipy, ...

Do not hesitate to <ins>***COMMENT***<ins> on your code and explain your ideas.

Try and answer all questions fully. If running out of time, please note that questions 6, 8 and 9 deliver the most points in the scoring system.

Throughout this entire exam, your goal will be to help a grocery company to better use its marketing campaigns. 

### <ins>How to debug your code: </ins>

To see the result of any print statements, you should:
1. Choose the tab "Custom input" next to the "Test Results" tab.
2. Fill in the text box that appears with: 'BCG'. You do not have to put your code in this box.


### <ins>Description of the data provided on Section 1: </ins>

You are provided two datasets containing data about:
1. customer_data: containing data about customers
2. orders_data: containing data about orders

You can access them with the following snippet:
```
customers_df = pd.read_csv('customer_data.csv')
orders_df = pd.read_csv('order_data.csv')
```

The table `customer_data` containing the following columns:
- `customer_id`: id of the customer
- `birth_date`: birth date of the customer 
- `acquisition_channel`: marketing channel through which the customer was acquired.

The table `orders_data` containing the following columns:
- `order_id`: id of the order
- `transacion_date`: date of the transaction
- `product`: ordered product
- `price`: price of the bought product in $
- `quantity`: quantity ordered of the product
- `customer_id`: customer issuing the order

Before getting modeling, we would like to do some preliminary analysis to understand better the data we have.


In [1]:
import pandas as pd
customers_df = pd.read_csv('customer_data.csv')
orders_df = pd.read_csv('order_data.csv')
orders_df.sample(frac = 0.2)

Unnamed: 0,transacion_date,product,price,quantity,orders_id,customer_id
290500,2016-08-04 15:07:00,IVORY SWEETHEART SOAP DISH,2.49,1,ba384e720c6940d29a257a4b040a9914,45e1ec9303114c7b80492895bf76e551
307877,2018-08-21 15:17:00,JUMBO BAG VINTAGE DOILY,2.08,2,85dea353abb04ab287bda4d8db4ef3e2,8118035675674e449fc044cd0a9fa243
419685,2018-10-26 12:28:00,3 GARDENIA MORRIS BOXED CANDLES,1.25,11,d55a5a5ab9194678a54a5b4bd8f69f73,9580760f2ca04f51866d91f903f9c124
461255,2018-11-13 10:35:00,VINTAGE CHRISTMAS PAPER GIFT BAG,0.82,10,1a303d3243974fb3a31f27c72ec59629,100bc232224d42a5b216d7006a803bba
110168,2017-03-06 11:07:00,EMERGENCY FIRST AID TIN,1.25,-1,5d48b13bf35a4470b9ecfed760edc206,aa21ec145f934cc5ad0033ae3b442b96
...,...,...,...,...,...,...
357921,2018-09-23 17:02:00,NO SINGING METAL SIGN,4.13,1,57126263ba8e47b4a255916919bf0be2,2d4f6bdc58d14e93b9bd84f790456e20
225091,2018-06-13 15:30:00,IVORY HANGING DECORATION HEART,1.63,1,2464bbbc2dd74afb82250557fce35201,930965fa8f0f4e079ddb1a7a757130d1
237234,2016-06-23 10:44:00,TUB 24 PINK FLOWER PEGS,1.65,1,4588fc2064ec45fdbb0aa19df1acdd78,8908a782358a48409935e355aea65005
154637,2018-04-13 11:04:00,WRAP DOILEY DESIGN,0.42,25,90e60bc2bb13459b83d0d50af016bda7,cf187e9d839b4064bd63e926869aa89c


In [2]:
customers_df.head()

Unnamed: 0,customer_id,birth_date,acquisition_channel
0,213803050f1a4336b286e6781d4a7073,1964/12/14-8:14:50,Web
1,77614bf0e41449d68586425edf550ef8,1964/10/23-19:32:14,Radio
2,c6c4fe164c09499db138fc9946b9d14b,1947/9/1-12:54:54,Billboard
3,29fd2cc211f941188d3254e4e9df378b,1987/7/24-3:11:23,Radio
4,db56add1ca8242cd90136337663cd6ec,1966/3/3-12:43:10,TV


In [3]:
orders_df.head()

Unnamed: 0,transacion_date,product,price,quantity,orders_id,customer_id
0,2016-12-01 08:26:00,WHITE HANGING HEART T-LIGHT HOLDER,2.55,6,487421d8e2cc41ccb62ef3719b46510e,213803050f1a4336b286e6781d4a7073
1,2016-12-01 08:26:00,WHITE METAL LANTERN,3.39,6,c33967c39e594e78a76d99d25d0d6ff9,77614bf0e41449d68586425edf550ef8
2,2016-12-01 08:26:00,CREAM CUPID HEARTS COAT HANGER,2.75,8,32ef4d6ee91d4d49b5393dc9ec007cc6,c6c4fe164c09499db138fc9946b9d14b
3,2016-12-01 08:26:00,KNITTED UNION FLAG HOT WATER BOTTLE,3.39,6,1f3ecb8a0aeb4eed93527a8f1f09a471,c6c4fe164c09499db138fc9946b9d14b
4,2016-12-01 08:26:00,RED WOOLLY HOTTIE WHITE HEART.,3.39,6,78ba1e6269394163aae507389075acdd,29fd2cc211f941188d3254e4e9df378b


### <ins>Problem 1:</ins>

What is the relative difference between the total sales on 2018 and 2017 ?
The relative difference is defined as: (sales_2018/sales_2017 - 1)
Sales are calculated as an ammount in $

In [4]:
big_table = pd.merge(orders_df, customers_df, how="left", on="customer_id")
big_table['amount'] = big_table['price']*big_table['quantity']
big_table['transacion_date'] = pd.to_datetime(big_table['transacion_date'], format='%Y/%m/%d %H:%M')
big_table['year'] = big_table['transacion_date'].dt.year
#big_table.head()
sales_by_year = big_table.groupby('year')['amount'].sum()
sales_2017 = sales_by_year.loc[2017]
sales_2018 = sales_by_year.loc[2018]
print("Sales in 2017: {}".format(sales_2017))
print("Sales in 2018: {}".format(sales_2018))
print("The relative difference between 2018 and 2017 is {}".format((sales_2018/sales_2017)-1))

Sales in 2017: 3091888.722
Sales in 2018: 3397950.332
The relative difference between 2018 and 2017 is 0.09898855926549088


# 2. Problem 2: Population type per acquisition channel

### <ins>Problem 2:</ins>

What is the median age per acquisition channel?
Please return a pandas dataframe containing the following columns:
- acquisition_channel: contains the channel name (radio, tv, ...)
- median_age: median age of the given channel

N.B: The median should be calculated on an integer 'age': convert the age to an integer before calculating the median.


In [5]:
big_table = pd.merge(orders_df, customers_df, how="left", on="customer_id")
big_table['birth_date'] = pd.to_datetime(big_table['birth_date'], format='%Y/%m/%d-%H:%M:%S', errors='coerce')
big_table.dropna(inplace=True)
big_table['birth_year'] = big_table['birth_date'].dt.year
big_table['age'] = 2022 - big_table["birth_year"]
big_table["age"].head()

big_table.groupby('acquisition_channel')['age'].agg("mean")

acquisition_channel
Billboard    49.460909
Radio        49.619083
TV           49.607502
Web          49.540526
Name: age, dtype: float64

In [6]:
res = big_table.where(big_table.quantity <= 0)[['product', 'quantity']]
res.dropna(inplace=True)
res.head()

Unnamed: 0,product,quantity
141,Discount,-1.0
154,SET OF 3 COLOURED FLYING DUCKS,-1.0
235,PLASTERS IN TIN CIRCUS PARADE,-12.0
236,PACK OF 12 PINK PAISLEY TISSUES,-24.0
237,PACK OF 12 BLUE PAISLEY TISSUES,-24.0


# 3. Problem 3: Popular product within millennials

### <ins>Problem 3:</ins>

What is the most popular product (in terms of number of sold units) among the millennials (born between 1981 and 1996 incl.) ?

In [7]:
big_table.dropna(inplace=True)
t = big_table.where((big_table.birth_year>=1981) & (big_table.birth_year<=1996))[["product", "quantity"]]
t.dropna(inplace=True)
# t.head(20)
t.groupby('product').agg("sum").sort_values(by=['quantity'], ascending=False).head(5)

Unnamed: 0_level_0,quantity
product,Unnamed: 1_level_1
WORLD WAR 2 GLIDERS ASSTD DESIGNS,14506.0
JUMBO BAG RED RETROSPOT,11714.0
PACK OF 72 RETROSPOT CAKE CASES,8814.0
WHITE HANGING HEART T-LIGHT HOLDER,8210.0
ASSORTED COLOUR BIRD ORNAMENT,7701.0


# 4. Linear regression (1/2)

## Section 2: Linear Regression 

We would like to understand which marketing channel is the most effective.
For that, the marketing department of our client provided us with a dataset containing weekly spends on each channel and the revenue generated that week during the 3 last years.

The data in on a tabular format with the following columns:
- week: a week identifier
- spends_tv: spendings on tv marketing campaign that week
- spends_radio: spendings on ads on the radio
- spends_web: spendings on web ads
- spends_billboard: spendings on phisical ads on billboards
- revenue: revenue generated during this week

Your colleague had the idea of treating this problem as regression problem where he tries to estimate the revenue as linear function of spendings.
He has performed a linear regression and obtained the following results: 



OLS Results
================================================================

| Variable    | Value       |
| ----------- | ----------- |
| Dep. Variable      | revenue       |
| Model   | OLS        |
| Method   | Least Squares        |
| Date   | Mon, 14 May 2018        |
| Time   | 21:48:12        |
| No. Observations   | 156        |
| Df Model   | 3        |
| Covariance type   | nonrobust        |
| R-squared   | 0.816        |
| Adj. R-squared   | 0.712        |
| F-statistic   | 6.646        |
| Prob(F-statistic)   | 0.00157        |
| Log-Likelihood   | -12.974        |


================================================================


| |coef| str err | t | P>\|t\| |
|--- | ---| --- | --- | --- |
|spends_tv | 10454.7| 197.2 | 53.02 | <0.00001 |
|spends_radio | 5984.2| 959.3 | 6.238 | 0.0041543 |
|spends_web | 8324.1| 134.5 | 61.89_ | <0.00001 |
|spends_billboard | 6278.5| 434.1 | 14.46 | <0.000359 |
|const | 30332.2| 202.1 | 150.1 | <0.00001 |

================================================================


| Variable    | Value       |
| ----------- | ----------- |
| Omnibus      | 0.176       |
| Prob(Omnibus)   | 0.916        |
| Skew   | 0.141        |
| Kurtosis   | 2.786        |
| Durbin-Watson   | 2.346        |
| Jarque-Bera (JB)   | 0.167        |
| Prob(JB)   | 0.920        |
| Cond. No.   | 176.        |


================================================================

Warnings:
[1] Standard errors assume that the convariance matrix of the errors is correctly specified.


### <ins>Problem 4:</ins>
<ins>Question 1:</ins>

With certainty, can you provide the MOST effective marketing channel?


Yes, it is TV.
To arrive to this result we will define effectiveness of a channel as the revenue generated by unit money spent in this channel.
And as we have a good R squared which is a good indicator of fit.
So:
$$\text{Revenue net generated by channel} \over \text{Expenses on this channel}$$
$$= \text{Revenue brut generated by channel} - \text{Expenses on this channel} \over \text{Expenses on this channel}$$
$$= (\text{Expenses on this channel}*\text{coef of OLS}) - \text{Expenses on this channel} \over \text{Expenses on this channel}$$
$$= \text{coef of OLS} - 1$$
So the variable that have the biggest coefficient is the most effective. In this case TV.
N.B: We have a good p-value for spends_tv which is a good indicator for fiability for this variable.

### <ins>Problem 5:</ins>
<ins>Question 2:</ins>

With certainty, can you provide the LEAST effective marketing channel?
Pick **ONE** option
- TV
- Radio 
- Web
- Billboard
- Can't say 


No, it could be the radio or billboard.
Following the preceding idea. It should be radio the least effective, however we can see that standard error is high. It is the same with billboard, and that in practice may be billboard the least effective channel, but with our model we cannot guarantee the least effective. 

# 6. Regression model to predict revenues

## Section 2: Build a regression model

The marketing department with which we are working want to send personalised promotions to targeted customers. They need help from us to get the highest value customers: customers who will generate the most revenues.

For this, you are asked to create a **regression model** that predicts the demand for a given customer and year. The **target value** is **revenue** which is equal to the **price \* quantity** <ins>**agregated at year level**<ins>

Important Note in this section:

Do not hesitate to write comments and to modularize your code.
You will be evaluated both on the result and the quality of your code.
If you get stuck in a question, do not hesitate to move on the next question.
Points are assigned independently for each question.



**Question 1:**

Using the two tables from Question 1 (orders_data and customer_data), create an aggregated table of revenus by year and customer_id.



In [37]:
big_table = pd.merge(orders_df, customers_df, how="left", on="customer_id")
big_table['revenus'] = big_table['price']*big_table['quantity']
big_table['transacion_date'] = pd.to_datetime(big_table['transacion_date'], format='%Y/%m/%d %H:%M')
big_table['year'] = big_table['transacion_date'].dt.year
big_table['month'] = big_table['transacion_date'].dt.month
big_table.head()

Unnamed: 0,transacion_date,product,price,quantity,orders_id,customer_id,birth_date,acquisition_channel,revenus,year,month
0,2016-12-01 08:26:00,WHITE HANGING HEART T-LIGHT HOLDER,2.55,6,487421d8e2cc41ccb62ef3719b46510e,213803050f1a4336b286e6781d4a7073,1964/12/14-8:14:50,Web,15.3,2016,12
1,2016-12-01 08:26:00,WHITE METAL LANTERN,3.39,6,c33967c39e594e78a76d99d25d0d6ff9,77614bf0e41449d68586425edf550ef8,1964/10/23-19:32:14,Radio,20.34,2016,12
2,2016-12-01 08:26:00,CREAM CUPID HEARTS COAT HANGER,2.75,8,32ef4d6ee91d4d49b5393dc9ec007cc6,c6c4fe164c09499db138fc9946b9d14b,1947/9/1-12:54:54,Billboard,22.0,2016,12
3,2016-12-01 08:26:00,KNITTED UNION FLAG HOT WATER BOTTLE,3.39,6,1f3ecb8a0aeb4eed93527a8f1f09a471,c6c4fe164c09499db138fc9946b9d14b,1947/9/1-12:54:54,Billboard,20.34,2016,12
4,2016-12-01 08:26:00,RED WOOLLY HOTTIE WHITE HEART.,3.39,6,78ba1e6269394163aae507389075acdd,29fd2cc211f941188d3254e4e9df378b,1987/7/24-3:11:23,Radio,20.34,2016,12


In [39]:
filtered_table = big_table[['year', 'customer_id', 'birth_date', 'price', 'quantity', 'acquisition_channel', 'revenus']]

In [44]:
agg_revenues = \
    filtered_table.groupby(
        ['year', 'customer_id']
        )\
        ['birth_date', 'price', 'quantity', 'acquisition_channel', 'revenus']\
        .agg(
            dict(
                birth_date='first',
                price='mean',
                quantity='sum',
                acquisition_channel='first',
                revenus='sum'
            )
        ).reset_index()
agg_revenues = agg_revenues.rename(columns={"price": "mean_price", "year": "year_transaction"})
# big_table.groupby('year')['amount', 'quantity'].agg(
#     dict(amount='sum', quantity='count')
# )

agg_revenues.head()

  """


Unnamed: 0,year_transaction,customer_id,birth_date,mean_price,quantity,acquisition_channel,revenus
0,2016,00031990d7154c5d890d2dababf5225c,1974/8/8-9:10:18,2.53,28,TV,48.1
1,2016,0003ce92d18d4fd995546c24728b44ff,1988/11/30-21:40:13,1.65,6,Web,9.9
2,2016,0003fa288daa4e909add27ef3c219a27,1960/2/3-3:12:7,1.5575,80,TV,32.33
3,2016,000616d6d3fd435a9ca693f2f2d064fd,1995/2/2-7:45:11,3.2,9,Web,41.05
4,2016,0006f73efa16466fa4cfac16dc26dcb9,1947/1/8-9:42:53,11.716667,27,TV,80.85


**Question 2:**
Create the following features on the aggregated table from question 1:
- age: customer age
- prev_year_revenue: revenue generated by the customer on the previous year
- prev_year_nb_products: number of distinct products bought by the customer on the previous year


In [48]:
# Age
CURRENT_YEAR = 2022
agg_revenues['birth_date'] = pd.to_datetime(agg_revenues['birth_date'], format='%Y/%m/%d-%H:%M:%S', errors='coerce')
agg_revenues.dropna(inplace=True)
agg_revenues['birth_year'] = agg_revenues['birth_date'].dt.year
agg_revenues['age'] = CURRENT_YEAR - agg_revenues["birth_year"]
agg_revenues.head()

Unnamed: 0,year_transaction,customer_id,birth_date,mean_price,quantity,acquisition_channel,revenus,birth_year,age,prev_year_revenue
52732,2017,00031990d7154c5d890d2dababf5225c,1974-08-08 09:10:18,3.85,9,TV,35.35,1974,48,48.1
52733,2017,0003ce92d18d4fd995546c24728b44ff,1988-11-30 21:40:13,2.045,26,Web,49.06,1988,34,9.9
52734,2017,0003fa288daa4e909add27ef3c219a27,1960-02-03 03:12:07,3.98,16,TV,-8.56,1960,62,32.33
52735,2017,000616d6d3fd435a9ca693f2f2d064fd,1995-02-02 07:45:11,4.12,14,Web,28.31,1995,27,41.05
52736,2017,0006f73efa16466fa4cfac16dc26dcb9,1947-01-08 09:42:53,3.105,20,TV,54.68,1947,75,80.85


In [46]:
# previous year revenue
agg_revenues_index = agg_revenues.set_index(['year_transaction', 'customer_id'])
def get_revenue_year_before(row):
    try:
        revenue_year_before = agg_revenues_index.loc[(row['year_transaction'] - 1, row['customer_id']), 'revenus']
    except KeyError:
        revenue_year_before = None
        
    return revenue_year_before

agg_revenues['prev_year_revenue'] = agg_revenues.apply(lambda row: get_revenue_year_before(row), axis=1)

In [47]:
agg_revenues_reduced = agg_revenues.dropna()
agg_revenues_reduced.head()

Unnamed: 0,year_transaction,customer_id,birth_date,mean_price,quantity,acquisition_channel,revenus,birth_year,age,prev_year_revenue
52732,2017,00031990d7154c5d890d2dababf5225c,1974-08-08 09:10:18,3.85,9,TV,35.35,1974,48,48.1
52733,2017,0003ce92d18d4fd995546c24728b44ff,1988-11-30 21:40:13,2.045,26,Web,49.06,1988,34,9.9
52734,2017,0003fa288daa4e909add27ef3c219a27,1960-02-03 03:12:07,3.98,16,TV,-8.56,1960,62,32.33
52735,2017,000616d6d3fd435a9ca693f2f2d064fd,1995-02-02 07:45:11,4.12,14,Web,28.31,1995,27,41.05
52736,2017,0006f73efa16466fa4cfac16dc26dcb9,1947-01-08 09:42:53,3.105,20,TV,54.68,1947,75,80.85


**Question 3:**
Can you add some other features that may help the model?
We expect you to add at least 2 features.


In [12]:
# Total number of products bought
# Number of months purchased by year
# Total number of transactions

In [26]:
agg_revenues_reduced["number_of_transactions"] = 1
agg_revenues_reduced['number_of_transactions'] = \
    agg_revenues_reduced['number_of_transactions'].groupby(agg_revenues_reduced['customer_id']).transform('sum')
agg_revenues_reduced


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,year,customer_id,birth_date,revenus,birth_year,age,prev_year_revenue,number_of_transactions
52732,2017,00031990d7154c5d890d2dababf5225c,1974-08-08 09:10:18,35.35,1974,48,48.10,2
52733,2017,0003ce92d18d4fd995546c24728b44ff,1988-11-30 21:40:13,49.06,1988,34,9.90,2
52734,2017,0003fa288daa4e909add27ef3c219a27,1960-02-03 03:12:07,-8.56,1960,62,32.33,2
52735,2017,000616d6d3fd435a9ca693f2f2d064fd,1995-02-02 07:45:11,28.31,1995,27,41.05,2
52736,2017,0006f73efa16466fa4cfac16dc26dcb9,1947-01-08 09:42:53,54.68,1947,75,80.85,2
...,...,...,...,...,...,...,...,...
158320,2018,fff7ec09d884403a806344f3483856f2,1977-05-18 11:55:35,31.66,1977,45,141.52,2
158321,2018,fffb8e46eb8f4f00b235b77e015cb26f,1993-07-12 08:36:15,14.09,1993,29,24.93,2
158322,2018,fffbb76fb452467ca5b396b41315473b,1991-04-06 10:11:57,47.49,1991,31,47.40,2
158323,2018,fffcb2bb9d784be6a0d010e509226deb,1995-02-28 12:46:19,33.01,1995,27,225.64,2


**Question 4:**
We will split the data on a training and test sets. The test set should correspond to orders in the year 2018, and the other years are training set. 
Train your regression model on the training set and then use it to predict the outcome on the test set.
Calculate the RMSE (Root Mean Square Error) on the test set and return it.

Models:
- Linear regression
- Polynomial regression
- Tree
- Elastic Net (strongly correlated data)
- Ensemble methods:
    - Random forest
    - Extra trees
    - XGBoost
    - LightBoost
    - Bagging
    - Voting
- Wavelets
- Neural Networks

In [179]:
agg_revenues_reduced.head()

Unnamed: 0,year,customer_id,birth_date,revenus,birth_year,age,prev_year_revenue
52732,2017,00031990d7154c5d890d2dababf5225c,1974-08-08 09:10:18,35.35,1974,48,48.1
52733,2017,0003ce92d18d4fd995546c24728b44ff,1988-11-30 21:40:13,49.06,1988,34,9.9
52734,2017,0003fa288daa4e909add27ef3c219a27,1960-02-03 03:12:07,-8.56,1960,62,32.33
52735,2017,000616d6d3fd435a9ca693f2f2d064fd,1995-02-02 07:45:11,28.31,1995,27,41.05
52736,2017,0006f73efa16466fa4cfac16dc26dcb9,1947-01-08 09:42:53,54.68,1947,75,80.85


In [180]:
dataset = agg_revenues_reduced
dataset_train = agg_revenues_reduced.loc[agg_revenues_reduced['year'] != 2018]
dataset_test = agg_revenues_reduced.loc[agg_revenues_reduced['year'] == 2018]

X_train = dataset_train.drop(["revenus"], axis=1)
y_train = dataset_train['revenus']
X_test = dataset_test.drop(["revenus"], axis=1)
y_test = dataset_test['revenus']

# 7. Data Assessment

## Section 3: Build a regression model



**Question:**

On the provided data, we only have the quantity of sold products. What other important data is missing to estimate the real demand? 


# Ask here

# 8. 

## Section 4: Model interpretation

We would like now to focus on a particular and rare product: 'Chia seeds'.

This product represent 2% of the sales volume (in terms of quantity),

Your colleague has build a classification model to predict if a given customer will buy this product.
The output of the algorithm is binary:
- 1 when the model predicts that the customer will buy 'Chia seeds'
- 0 when the model predicts that the customer will **NOT** buy 'Chia seeds'

The model your colleague made has a good accuracy:
- For Chia seeds buyers, the model is correct 98% of the time.
- For **NON** Chia seeds buyers, the model is correct 98% of the time.

You have run the model on a customer from your database, and the model predicted a positive answer meaning that he will buy chia seeds.


**Question:**

What is the probability of this customer to be a chia seeds buyer?

Pick **ONE** option:
- 99%
- 90%
- 80%
- 70%
- **50%** <--------
- 40%
- 30%
- 20%
- 10%
- 1%




---

Model: 

Random variable X : 

    - 0: Consumer will not buy Chia.
    - 1: Consumer will buy Chia.

Random variable Y :

    - 0: Model predict that customer will not buy Chia.
    - 1: Model predict that custumer will buy Chia.

---

Data from problem:
$$P(X=0)=0.98$$
$$P(X=1)=0.02$$
$$P(Y=1|X=1)=0.98$$
$$\Rightarrow P(Y=0|X=1)=0.02$$
$$P(Y=0|X=0)=0.98$$
$$\Rightarrow P(Y=1|X=0)=0.02$$

---

Question: $$P(X=1|Y=1)=?$$

---
Solution:
$$P(X=1|Y=1)$$
$$=\dfrac{P(X=1,Y=1)}{P(Y=1)}$$
$$=\dfrac{P(Y=1,X=1)}{\sum_i P(Y=1,X=i)}$$
$$=\frac{P(Y=1,X=1)} {P(Y=1,X=0) + P(Y=1,X=1)}$$
$$=\dfrac{1}{1 + \dfrac{P(Y=1,X=0)}{P(Y=1,X=1)}}$$
$$=\dfrac{1}{1 + \dfrac{P(Y=1|X=0)P(X=0)}{P(Y=1|X=1)P(X=1)}}$$
$$=\dfrac{1}{1 + \dfrac{0.02*0.98}{0.98*0.02}}$$
$$=\dfrac{1}{2}$$
$$=0.5$$


# 9. 

## Section 5: Preprocessing Step

Imagine you have to develop a regression model.
After gathering all the data you need on one table, and after you build your features, you ended up with a table having 7000 observations and 8000 features.

What is your next step?
Can you provide 3 different techniques to do it?
Can you also explain the main differences between them?

Please provide your answer in the following editor.



Feature selection:

    - PCA
        Maximize variance (eigenvalues)

    - LDA
        Maximize separation between classes (maximize between-correlation)

Regression:

    - Supervised principal components
    Same as PCA but with attention on labeled data
