# coding sessions


## 1. probability case
**Context:** You have a box containing 10 six-sided dice.
*   **Defective Die (1):** One die is defective with the following probabilities for its faces:
    *   Side 1: 10%
    *   Side 2: 10%
    *   Side 3: 10%
    *   Side 4: 20%
    *   Side 5: 20%
    *   Side 6: 30%
*   **Normal Dice (9):** The other nine dice are fair, with the probability of rolling any side being 1/6.

**Task:** You randomly select **two** dice from the box and roll them. Calculate the **expected sum** of the values you will roll. Round your answer to the nearest thousandth.


In [3]:
defective_dice_expectation = 0.1 * (1 + 2 + 3) + 0.2 * (4 + 5) + 0.3 * 6
print(defective_dice_expectation)
normal_dice_expectations = 3.5
final_expectations = 2 * (0.1 * defective_dice_expectation + 0.9 * normal_dice_expectations)
print(final_expectations)

4.2
7.14


## 2. distribution problem
> A scientist is carrying out a series of experiments. Each experiment can end with either a success or a failure. The probability of success is **p = 0.820**, and the probability of failure is **q = 0.180**. Experiments in a series are independent of one another.
>
> If an experiment ends with a success, the detector registers its results correctly with a probability of **pr = 0.960**. If the experiment ends with a failure, nothing is registered.
>
> The scientist is going to run a series of **20 experiments**. Calculate the probability of getting **exactly 16** experiment results registered correctly on the detector. Round your answer to the nearest thousandth (three decimal places).


In [5]:
p_success = 0.820
q_failure = 0.180
pr_registered = 0.960
n_experiments = 20
p_registered_success = p_success * pr_registered
# P(X=k) = C(n, k) * (p_final^k) * ((1 - p_final)^(n-k))
from math import comb
def probability_of_success(k, n, p):
    return comb(n, k) * (p ** k) * ((1 - p) ** (n - k))
k_success = 16
probability_of_success_16 = probability_of_success(k_success, n_experiments, p_registered_success)
print(round(probability_of_success_16, 3))

0.216


**Scenario:** A company runs an advertising campaign. On any given day, a person who sees the ad has a **5% chance (`p = 0.05`)** of clicking it. The ad is shown to **100 people (`n = 100`)** today.

**Your Tasks:**

1.  Write Python code to calculate the probability that **exactly 7 people** click the ad.
2.  Write Python code to calculate the probability that **10 or fewer people** click the ad.
3.  Write Python code to calculate the probability that **more than 4 people** click the ad.


In [6]:
n = 100 # number of trials
p = 0.05 # probability of success
from scipy.stats import binom
from math import comb
k = 7 # number of successes
def probability_of_success(k, n, p):
    return comb(n, k) * (p ** k) * ((1 - p) ** (n - k))
probability_of_success_7 = probability_of_success(k, n, p)

In [7]:
probability_of_success_7

0.10602553736478867

In [9]:
def cumulative_probability(k, n, p):
    return sum(probability_of_success(i, n, p) for i in range(k + 1))
def right_tail_probability(k, n, p):
    return 1 - cumulative_probability(k - 1, n, p)

cumulative_probability_10 = cumulative_probability(10, n, p)
right_tail_probability_4 = right_tail_probability(5, n, p)
print(cumulative_probability_10)
print(right_tail_probability_4)

0.9885275899325113
0.5640186993142899


In [10]:
# Using the binom distribution from scipy.stats for accuracy and efficiency
prob_more_than_4 = 1 - binom.cdf(4, n, p)
print(round(prob_more_than_4, 3))

0.564


In [11]:
# Import the necessary library
from scipy.stats import binom

# --- Define the parameters of our distribution ---
n = 100  # Number of trials (people shown the ad)
p = 0.05 # Probability of success (a single person clicking)

# --- Task 1: Probability of EXACTLY 7 clicks ---
# We use the Probability Mass Function (pmf) for this.
k_exact = 7
prob_exact_7 = binom.pmf(k=k_exact, n=n, p=p)
print(f"The probability of exactly {k_exact} clicks is: {prob_exact_7:.4f}")
# Expected output: The probability of exactly 7 clicks is: 0.1060

# --- Task 2: Probability of 10 OR FEWER clicks ---
# We use the Cumulative Distribution Function (cdf) for this.
k_le_10 = 10
prob_le_10 = binom.cdf(k=k_le_10, n=n, p=p)
print(f"The probability of {k_le_10} or fewer clicks is: {prob_le_10:.4f}")
# Expected output: The probability of 10 or fewer clicks is: 0.9885

# --- Task 3: Probability of MORE THAN 4 clicks ---
# We use the Survival Function (sf), which is 1 - cdf.
# sf(k) calculates P(X > k).
k_gt_4 = 4
prob_gt_4 = binom.sf(k=k_gt_4, n=n, p=p)
print(f"The probability of more than {k_gt_4} clicks is: {prob_gt_4:.4f}")
# Expected output: The probability of more than 4 clicks is: 0.5832

The probability of exactly 7 clicks is: 0.1060
The probability of 10 or fewer clicks is: 0.9885
The probability of more than 4 clicks is: 0.5640


## 3. Data Manipulation

In [12]:
# Setup: Create the dummy CSV files for our case study
import pandas as pd
import os

# Create a directory for our data
if not os.path.exists('sales_data'):
    os.makedirs('sales_data')

# --- Sales Files (with messy columns) ---
sales_na_data = """sale_id,product_id,customer_id,Total Sale
101,A54,C1,$150.50
102,B12,C2,$75.00
103,A54,C3,$140.25
"""
with open('sales_data/sales_north_america.csv', 'w') as f:
    f.write(sales_na_data)

sales_eu_data = """Sale ID,Product ID,CustomerID,Total Sale
201,C78,C4,$25.99
202,B12,C5,$80.00
"""
with open('sales_data/sales_europe.csv', 'w') as f:
    f.write(sales_eu_data)

sales_asia_data = """sale id,product id,customer id,Total_Sale
301,A54,C6,$155.00
302,D99,C7,$200.10
"""
with open('sales_data/sales_asia.csv', 'w') as f:
    f.write(sales_asia_data)

# --- Supporting Info Files ---
products_data = """product_id,product_name,category
A54,Laptop,Electronics
B12,Mouse,Electronics
C78,T-Shirt,Apparel
D99,Keyboard,Electronics
"""
with open('sales_data/products.csv', 'w') as f:
    f.write(products_data)

customers_data = """id,first_name,last_name,country
C1,John,Doe,USA
C2,Jane,Smith,Canada
C3,Peter,Jones,USA
C4,Hans,Schmidt,Germany
C5,Isabelle,Dubois,France
C6,Kenji,Tanaka,Japan
C7,Li,Wei,China
"""
with open('sales_data/customers.csv', 'w') as f:
    f.write(customers_data)

# --- Bonus Files for another Concat Example ---
promotions_q1_data = """promo_id,discount_percent
P1,10
P2,15
"""
with open('sales_data/promos_q1.csv', 'w') as f:
    f.write(promotions_q1_data)
    
promotions_q2_data = """promo_id,discount_percent
P3,20
P4,5
"""
with open('sales_data/promos_q2.csv', 'w') as f:
    f.write(promotions_q2_data)

print("Dummy CSV files created successfully in the 'sales_data' directory.")

Dummy CSV files created successfully in the 'sales_data' directory.


**Scenario:** You are a data scientist at a global retail company. The sales data for the first quarter is spread across multiple CSV files from different regional offices (North America, Europe, Asia). Additionally, there are separate files for product information and customer details. The data is messy. Your task is to clean and consolidate all this information into a single, master DataFrame for analysis.

**The Challenge:**
1.  Combine sales data from three regional files.
2.  Standardize the column names, which have typos and different casing.
3.  Clean and convert the `Total Sale` column to a numeric type.
4.  Merge the consolidated sales data with product and customer information.
5.  Create a final, clean DataFrame ready for analysis.


In [27]:
import pandas as pd
import os

# step 1: load sales data from multiple CSV files
sales_files = [
    'sales_data/sales_north_america.csv',
    'sales_data/sales_europe.csv',
    'sales_data/sales_asia.csv'
]
list_of_dfs = [pd.read_csv(files) for files in sales_files]

print(list_of_dfs[0].columns)
print(list_of_dfs[1].columns)
print(list_of_dfs[2].columns)

Index(['sale_id', 'product_id', 'customer_id', 'Total Sale'], dtype='object')
Index(['Sale ID', 'Product ID', 'CustomerID', 'Total Sale'], dtype='object')
Index(['sale id', 'product id', 'customer id', 'Total_Sale'], dtype='object')


In [28]:
clean_columns = ['sale_id', 'product_id', 'customer_id', 'total_sale']

for df in list_of_dfs:
    df.columns = clean_columns

sale_dfs = pd.concat(list_of_dfs, ignore_index=True)
print(sale_dfs.head())

   sale_id product_id customer_id total_sale
0      101        A54          C1    $150.50
1      102        B12          C2     $75.00
2      103        A54          C3    $140.25
3      201        C78          C4     $25.99
4      202        B12          C5     $80.00


In [29]:
sale_dfs.total_sale = sale_dfs.total_sale.str.replace("$", "").astype(float)
sale_dfs.head(3)

Unnamed: 0,sale_id,product_id,customer_id,total_sale
0,101,A54,C1,150.5
1,102,B12,C2,75.0
2,103,A54,C3,140.25


In [30]:
promos_files = ['sales_data/promos_q1.csv',
    'sales_data/promos_q2.csv']
promos = [pd.read_csv(file) for file in promos_files]
promo = pd.concat(promos, ignore_index=True )
print(promos[0].head(1))
print(promos[1].head(1))

  promo_id  discount_percent
0       P1                10
  promo_id  discount_percent
0       P3                20


In [31]:
customers = pd.read_csv('sales_data/customers.csv')
products = pd.read_csv('sales_data/products.csv')
customers.head(1)


Unnamed: 0,id,first_name,last_name,country
0,C1,John,Doe,USA


In [32]:
products.head(1)

Unnamed: 0,product_id,product_name,category
0,A54,Laptop,Electronics


In [34]:
final_data = sale_dfs.merge(products, left_on='product_id', right_on='product_id', how ='left')
final_data.head(3)

Unnamed: 0,sale_id,product_id,customer_id,total_sale,product_name,category
0,101,A54,C1,150.5,Laptop,Electronics
1,102,B12,C2,75.0,Mouse,Electronics
2,103,A54,C3,140.25,Laptop,Electronics


In [35]:
final_data = final_data.merge(customers, left_on = 'customer_id', right_on='id', how='left')
final_data.head(3)

Unnamed: 0,sale_id,product_id,customer_id,total_sale,product_name,category,id,first_name,last_name,country
0,101,A54,C1,150.5,Laptop,Electronics,C1,John,Doe,USA
1,102,B12,C2,75.0,Mouse,Electronics,C2,Jane,Smith,Canada
2,103,A54,C3,140.25,Laptop,Electronics,C3,Peter,Jones,USA


In [36]:
sales_by_country = final_data.groupby('country')['total_sale'].agg(['sum', 'mean', 'count'])
print(sales_by_country)

            sum     mean  count
country                        
Canada    75.00   75.000      1
China    200.10  200.100      1
France    80.00   80.000      1
Germany   25.99   25.990      1
Japan    155.00  155.000      1
USA      290.75  145.375      2


## Sklearn
**Mock Case:** A telecom company wants to predict customer churn. You are given a small dataset of customer information and need to build a model to predict whether a customer will churn (`Churn` = 1) or not (`Churn` = 0).


In [37]:
# Core libraries
import pandas as pd
import numpy as np

# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score


In [38]:
data = {
    'CustomerID': range(1, 11),
    'Tenure_Months': [12, 24, 5, 48, 60, 6, 1, 35, 22, 40],
    'Subscription_Type': ['Basic', 'Premium', 'Basic', 'Premium', 'Premium', 'Basic', 'Basic', 'Premium', 'Basic', 'Premium'],
    'Monthly_Bill': [20, 70, 20, 80, 85, np.nan, 15, 75, 25, 78], # Note the missing value
    'Churn': [0, 0, 1, 0, 0, 1, 1, 0, 1, 0] # Target variable
}
df = pd.DataFrame(data)

print("Original Data:")
print(df)

Original Data:
   CustomerID  Tenure_Months Subscription_Type  Monthly_Bill  Churn
0           1             12             Basic          20.0      0
1           2             24           Premium          70.0      0
2           3              5             Basic          20.0      1
3           4             48           Premium          80.0      0
4           5             60           Premium          85.0      0
5           6              6             Basic           NaN      1
6           7              1             Basic          15.0      1
7           8             35           Premium          75.0      0
8           9             22             Basic          25.0      1
9          10             40           Premium          78.0      0


In [43]:
X = df.drop(columns=['CustomerID', 'Churn'])
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42, stratify=y)

numerical_features = ['Tenure_Months', 'Monthly_Bill']
categorical_features = ['Subscription_Type']

numerical_transformer = Pipeline(steps =  [
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps = [
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
preprocessor = ColumnTransformer(transformers = [
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])




In [None]:
X_trans = preprocessor.fit_transform(X_train)
X_trans

array([[ 0.73621206,  0.78048462,  0.        ,  1.        ],
       [-0.09602766, -1.08893962,  1.        ,  0.        ],
       [-0.73621206, -1.27588204,  1.        ,  0.        ],
       [ 1.56845178,  0.96742704,  0.        ,  1.        ],
       [-1.1203227 ,  0.59354219,  1.        ,  0.        ],
       [-1.4404149 , -1.46282446,  1.        ,  0.        ],
       [ 1.05630426,  0.89265007,  0.        ,  1.        ],
       [ 0.03200922,  0.59354219,  0.        ,  1.        ]])

In [None]:
lr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])
lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)
print("Logistic Regression Model Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

Logistic Regression Model Accuracy: 1.0
Confusion Matrix:
 [[1 0]
 [0 1]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



In [47]:
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100,
                                          max_depth=5,
        random_state=42))
])
rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_test)
print("Random Forest Model Accuracy:", accuracy_score(y_test, y_pred_rf))
print("ROC AUC Score:", roc_auc_score(y_test, rf_pipeline.predict_proba(X_test)[:, 1]))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

Random Forest Model Accuracy: 1.0
ROC AUC Score: 1.0
Confusion Matrix:
 [[1 0]
 [0 1]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

