# Generated Exercise Notebook

**Source dataset:** `be192d6b-2a99-44a6-b829-f7922726520e.csv`  
Rows: **10000**, Columns: **8**  

**Columns:** ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

**Numeric columns:** []

**Categorical columns:** ['Transaction ID', 'Item', 'Quantity', 'Price Per Unit', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date']

---

Open each question cell, run the starter code, and finish the required steps.

In [None]:
# Load dataset
import pandas as pd
df = pd.read_csv(r"/mnt/data/be192d6b-2a99-44a6-b829-f7922726520e.csv")
df.head()

## Question 1 — Basic dataset overview

Write code to show:
- `.info()`
- number of missing values per column
- basic descriptive statistics for numeric columns
Explain any observations in a markdown cell below the outputs.

In [None]:
print(df.info())
print('\nMissing values per column:\n', df.isnull().sum())
print('\nDescriptive statistics:\n', df.describe(include='all'))

## Question 2 — Data types and conversion

Identify columns with incorrect data types (e.g., numeric stored as object). Convert at least one such column to the appropriate dtype and verify conversion.

In [None]:
df.dtypes
# Example conversion (replace 'colname' with actual column):
# df['colname'] = pd.to_numeric(df['colname'], errors='coerce')
# df['datecol'] = pd.to_datetime(df['datecol'], errors='coerce')
# Verify:
# df.dtypes

## Question 3 — Duplicates & unique values

Detect duplicate rows and drop them if appropriate. For each categorical column, show the number of unique values and the top 5 frequent categories.

In [None]:
print('Duplicate rows:', df.duplicated().sum())
# df = df.drop_duplicates()
for c in df.select_dtypes(include=['object','category']).columns:
    print('\nColumn:', c)
    print('Unique values:', df[c].nunique())
    print('Top 5 frequent:')
    print(df[c].value_counts().head())

## Question 4 — Handling missing values

Choose a missing-value strategy for each column with missing data (drop, fill with mean/median/mode, forward/backward fill). Implement the chosen strategy and show before/after counts.

In [None]:
# Example strategies. Replace with chosen strategies
missing = df.isnull().sum()[df.isnull().sum()>0]
print('Columns with missing values:\n', missing)
# Example:
# df['num_col'] = df['num_col'].fillna(df['num_col'].median())
# df['cat_col'] = df['cat_col'].fillna(df['cat_col'].mode()[0])
# Verify:
# print(df.isnull().sum())

## Question 5 — Exploratory plotting

Create at least three plots that help understand the dataset:
- Histogram or KDE for numeric columns
- Boxplot to spot outliers
- Bar chart for a categorical column
Include brief interpretation of each plot.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Example histogram for numeric columns
for c in df.select_dtypes(include=['number']).columns[:3]:
    plt.figure()
    df[c].hist()
    plt.title(f'Histogram of {c}')

# Example boxplot
for c in df.select_dtypes(include=['number']).columns[:3]:
    plt.figure()
    df.boxplot(column=c)
    plt.title(f'Boxplot of {c}')

# Bar chart for first categorical column
cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()
if len(cat_cols)>0:
    plt.figure()
    df[cat_cols[0]].value_counts().head(10).plot(kind='bar')
    plt.title(f'Top categories in {cat_cols[0]}')

## Question 6 — Outliers

Detect outliers in numeric columns using the IQR method. Propose and implement one method to handle them (cap, remove, or transform). Show effect on distribution.

In [None]:
def iqr_filter(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    low = Q1 - 1.5*IQR
    high = Q3 + 1.5*IQR
    return low, high

for c in df.select_dtypes(include=['number']).columns[:4]:
    low, high = iqr_filter(df[c].dropna())
    print(c, 'low', low, 'high', high)
    print('Outliers count:', ((df[c] < low)|(df[c] > high)).sum())

# Example capping:
# df[c] = df[c].clip(lower=low, upper=high)


## Question 7 — Correlation analysis

Compute pairwise correlation for numeric features and visualize it (heatmap). Identify any strong correlations (>|0.7|).

In [None]:
import matplotlib.pyplot as plt
corr = df.select_dtypes(include=['number']).corr()
print(corr)
plt.figure(figsize=(8,6))
import numpy as np
plt.imshow(corr, interpolation='nearest')
plt.colorbar()
plt.xticks(range(len(corr)), corr.columns, rotation=90)
plt.yticks(range(len(corr)), corr.columns)
plt.title('Correlation matrix (visual)')

# Identify strong correlations
strong = []
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if abs(corr.iloc[i,j]) > 0.7:
            strong.append((corr.columns[i], corr.columns[j], corr.iloc[i,j]))
print('Strong correlations > 0.7:', strong)

## Question 8 — Feature engineering

Create at least two new features from existing columns (example: ratio, interaction term, datetime parts). Show code and rationale.

In [None]:
# Example: if there are numeric columns 'a' and 'b'
# df['a_to_b_ratio'] = df['a'] / (df['b'] + 1e-9)
# If a datetime column exists: df['month'] = pd.to_datetime(df['datecol']).dt.month

# Print head to show new features
# df.head()

## Question 9 — Encoding categorical variables

Encode categorical variables using appropriate techniques (one-hot, label encoding, target encoding). Provide code and explanation for choices.

In [None]:
from sklearn.preprocessing import LabelEncoder
cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()
print('Categorical columns:', cat_cols)
# Example label encoding:
# le = LabelEncoder()
# df['col_le'] = le.fit_transform(df['col'])
# Example one-hot:
# df = pd.get_dummies(df, columns=['col1','col2'], drop_first=True)


## Question 10 — Build a simple model

Identify a suitable target column (suggestion: ['Transaction Date']). Split data into train/test, train a simple model (e.g., LogisticRegression for classification or LinearRegression for regression), evaluate using appropriate metrics, and report results.

In [None]:
# Example starter (replace 'target' with actual target column):
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, mean_squared_error

# Prepare X, y (this is a placeholder — replace with proper preprocessing)
# y = df['target']
# X = df.drop(columns=['target'])
# X = pd.get_dummies(X, drop_first=True)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# If classification:
# model = LogisticRegression(max_iter=1000)
# model.fit(X_train, y_train)
# preds = model.predict(X_test)
# print('Accuracy:', accuracy_score(y_test, preds))

# If regression:
# model = LinearRegression()
# model.fit(X_train, y_train)
# preds = model.predict(X_test)
# print('RMSE:', mean_squared_error(y_test, preds, squared=False))


## Question 11 — Model tuning & validation

Perform cross-validation and simple hyperparameter tuning (GridSearchCV or RandomizedSearchCV) on the model from Q10. Report the best parameters and CV score.

In [None]:
# Example starter:
from sklearn.model_selection import GridSearchCV
# param_grid = {'C':[0.01,0.1,1,10]}
# grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
# grid.fit(X_train, y_train)
# print(grid.best_params_, grid.best_score_)


## Question 12 — Reproducibility & report

Save the final cleaned dataset to a CSV (`cleaned_data.csv`), and write a short markdown summary (3-5 bullet points) describing the key findings and the modeling results.

In [None]:
# Example save:
# df.to_csv('cleaned_data.csv', index=False)
# Print a template for the summary:
print('Write 3-5 bullet points summarizing findings and model performance here.')