#### Exercise 1
<!-- @q -->

1. What kinds of EDA techniques might you use to explore the following types of data:
    - Numeric data?  
    1 Operations like count, min, max, mean, standard deviation, median. Plot graphs like Histogram, Scatterplots.
    2 Check for missing values.
    3 Outlier Detection and Handling
    4 Look at corelations between columns.
    5 Data Transformation
    6 Visualization
    7 Feature Emgimeering

    - Categorical data?  
    1 Understand the data.
    2 Handling missing and duplicate data and data transforming.
    4 Single Variable Analysis or relation between multiple variables
    5 Visualization and Summarization of result.

    - The relationship between categorical and numeric data?
    1 Categorical data and numerical data are connected. Categories can affect the numbers (like “male” or “female” categories affecting average height), and numbers can help describe what each category looks like (like the average age in each group).

*Enter your answer in this cell*

2. Generate some fake data (~1000 rows) with 1 categorical column (with 10 categories) and 2 numeric columns. Use the techniques you mentioned to explore the numeric, categorical, and the relationship between them.

In [25]:
import numpy as np
import pandas as pd

# set seed for repeatable results
np.random.seed(42)

# number of rows
n = 1000

# student ids
student_id = np.arange(1, n + 1)

# academic year categories
years = np.random.choice(
    ['Freshman', 'Sophomore', 'Junior', 'Senior', 'Graduate'],
    size=n,
    p=[0.2, 0.2, 0.2, 0.25, 0.15]  # rough distribution
)

# gender categories
genders = np.random.choice(['Male', 'Female', 'Other'], size=n, p=[0.48, 0.48, 0.04])

# majors
majors = np.random.choice(
    ['Computer Science', 'Information Systems', 'Business', 'Psychology', 'Biology', 'Engineering'],
    size=n
)

# helper to sample ages by academic year
def sample_age(year_array):
    ages = []
    for y in year_array:
        if y == 'Freshman':
            age = np.random.normal(18.5, 0.6)  # 18–19
        elif y == 'Sophomore':
            age = np.random.normal(19.5, 0.6)  # 19–20
        elif y == 'Junior':
            age = np.random.normal(20.5, 0.7)  # 20–21
        elif y == 'Senior':
            age = np.random.normal(21.8, 0.8)  # 21–23
        else:  # Graduate
            age = np.random.normal(25, 2.0)   # 22–30
        ages.append(age)
    return np.array(ages)

# generate ages and GPA
ages = sample_age(years)
ages = np.clip(ages, 17, 30)  # keep realistic

gpa = np.clip(np.random.normal(3.2, 0.4, size=n), 0.0, 4.0)  # between 0 and 4

# add some missingness
ages[np.random.rand(n) < 0.03] = np.nan
gpa[np.random.rand(n) < 0.02] = np.nan

# build dataframe
df = pd.DataFrame({
    'student_id': student_id,
    'year': years,
    'gender': genders,
    'major': majors,
    'age': np.round(ages, 1),
    'gpa': np.round(gpa, 2)
})

# quick peek at data
df.head()


Unnamed: 0,student_id,year,gender,major,age,gpa
0,1,Sophomore,Male,Psychology,19.3,3.38
1,2,Graduate,Female,Business,25.7,2.74
2,3,Senior,Female,Psychology,22.6,2.91
3,4,Junior,Female,Engineering,21.6,2.79
4,5,Freshman,Female,Information Systems,18.1,2.3


In [26]:
# Your code here


In [27]:
# Your code here


In [28]:
# Your code here


In [29]:
# Your code here


#### Exercise 2


Generate a data set you can use with a supervised ML model.  The data should meet the following criteria:
   - It should have 1000 rows
   - It should have 6 columns, with one column (your "target" column being a boolean column), one categorical column with 5 categories, and 4 numeric columns.
   - The numeric columns should have dramatically different scales - different means, different std. deviations.
   - Each non-target column should have about 5% nulls.

Make this data a little more interesting by calculating the target column using a noisy function of the other columns.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# pick features and make a simple target
X = df[['year', 'gender', 'major', 'age', 'gpa']].copy()   # features i want
y = (df['gpa'] >= 3.0).astype(int)                         # 1 if gpa is 3 or more

# fill missing numeric with median
X[['age', 'gpa']] = X[['age', 'gpa']].fillna(X[['age', 'gpa']].median())

# fill missing categorical with most frequent
for col in ['year', 'gender', 'major']:
    X[col] = X[col].fillna(X[col].mode().iloc[0])

# turn categories into numbers
X_enc = pd.get_dummies(X, drop_first=True)                 

# make the model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# 5 fold f1 score
scores = cross_val_score(log_reg, X_enc, y, scoring='f1', cv=5)

print("f1 mean:", scores.mean().round(3))
print("each fold:", np.round(scores, 3))


f1 mean: 0.965
each fold: [0.971 0.964 0.978 0.96  0.953]


In [31]:
# Your code here


#### Exercise 3

Use whatever resources you need to figure out how to build an SKLearn ML pipelines. Use a pipeline to build an ML approach to predicting your target column in the preceding data with logistic regression.  I have set up the problem below so that you will write your code in a function function call that takes an SKLearn model and data frame and returns the results of a cross validation scoring routine.  

I have not taught you how to do this; use the book, google, the notes, chatgpt, or whatever. This is a test of your ability to *find* information, and use this to construct a solution. Your solution should:

- Use a transformer pipeline that processes your numeric and categorical features separately
- Place everything in a pipeline with the classifier that is passed in to the function.
- I've already implemented the call to cross_val_score - to make it work, you'll need to assign your pipeline to the `pipeline` variable.

_Note: You could just feed this question to AI and get an answer, and chances are, it will be right. But if you do, you won't really learn much. So, be thoughtful in your use of AI here - you can use it to build the solution step by step, and it will explain how everything works. It's all in how you use it. So, it's your choice - go for the easy grade, or learn something._

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# assume df already exists from exercise 2 with columns:
# year, gender, major, age, gpa
# if not, raise a friendly error
required_cols = {'year', 'gender', 'major', 'age', 'gpa'}
if not required_cols.issubset(set(df.columns)):
    raise ValueError("df must have columns: year, gender, major, age, gpa")

# pick features and make a simple target
X = df[['year', 'gender', 'major', 'age', 'gpa']].copy()   # features i want
y = (df['gpa'] >= 3.0).astype(int)                         # target 1 if gpa at least 3

# fill missing numeric with median
X[['age', 'gpa']] = X[['age', 'gpa']].fillna(X[['age', 'gpa']].median())

# fill missing categorical with most frequent
for col in ['year', 'gender', 'major']:
    X[col] = X[col].fillna(X[col].mode().iloc[0])

# one hot encode all categories
X_enc = pd.get_dummies(X, drop_first=True)                 

# build random forest
rf = RandomForestClassifier(
    n_estimators=200,    # enough trees for stable scores
    random_state=42,     # repeatable
    n_jobs=-1            # use all cores
)

# run 5 fold cv with f1
scores = cross_val_score(rf, X_enc, y, scoring='f1', cv=5)

print("rf f1 mean:", scores.mean().round(3))
print("rf each fold:", np.round(scores, 3))


KeyError: 'target'

Try using a `RandomForestClassifier` in the preceding pipeline. Just call `run_classifier` with a `RandomForestClassifier`, and print out the results as above.

In [None]:
# Your code here


Normally, `RandomForestClassifier`s are considered to be more powerful than `LogisticRegression`.  Depending on your data, this may or may not be the case. Reflect on your answers - which one does better here, and why do you think that is?  Once again, you might use AI, but you should probably also try to _understand_ the answer.

*Enter your answer in this cell*