## Python For Machine Learning Fall 2025
---
# Example solution for Competition 1

This example uses pandas to read the CSV files. If you are not familiar with pandas, you can use the built-in csv module instead.

The first step is to read the training file and the test file.

In [None]:
# Import libraries as needed
import pandas as pd
import numpy as np

# Import our training and testing data from the CSVs
raw_training_data = pd.read_csv('./train.csv')
raw_testing_data = pd.read_csv('./test.csv')

The next step is to preprocess the data, which includes transforming categorical variables into separate features.

In [13]:
# Preprocess the data into a feature matrix
# The get_dummies function takes the headers that contain strings and breaks them into separate bool values
# drop_first = true makes it so that we dont get sex_male and sex_female since they are redundant
training_data_encoded = pd.get_dummies(raw_training_data, columns=['sex', 'smoker', 'region'], drop_first=True)
testing_data_encoded = pd.get_dummies(raw_testing_data, columns=['sex', 'smoker', 'region'], drop_first=True)

print(training_data_encoded)

      age     bmi  children      charges  sex_male  smoker_yes  \
0      19  27.900         0  16884.92400     False        True   
1      18  33.770         1   1725.55230      True       False   
2      28  33.000         3   4449.46200      True       False   
3      33  22.705         0  21984.47061      True       False   
4      32  28.880         0   3866.85520      True       False   
...   ...     ...       ...          ...       ...         ...   
1333   50  30.970         3  10600.54830      True       False   
1334   18  31.920         0   2205.98080     False       False   
1335   18  36.850         0   1629.83350     False       False   
1336   21  25.800         0   2007.94500     False       False   
1337   61  29.070         0  29141.36030     False        True   

      region_northwest  region_southeast  region_southwest  
0                False             False              True  
1                False              True             False  
2                False  

As a basic data preparation step, split the training data into a feature matrix and a target vector, and remove the ID column from the test data.

In [15]:
# Remove the charges column from our encoded training data
training_data = training_data_encoded.drop('charges', axis=1)

# Remove the id column from our encoded test data
test_data = testing_data_encoded.drop('ID', axis=1)

# Pull the target information from the encoded training data
training_targets = training_data_encoded['charges']

Perform standardization on the numeric features to ensure they adhere to a normal distribution.

In [16]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

numerical_cols = ['age', 'bmi', 'children']
training_data[numerical_cols] = scaler.fit_transform(training_data[numerical_cols])
test_data[numerical_cols] = scaler.fit_transform(test_data[numerical_cols])
display(training_data.head(1))
display(training_targets.head(1))
display(test_data.head(1))

Unnamed: 0,age,bmi,children,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,-1.438764,-0.45332,-0.908614,False,True,False,False,True


0    16884.924
Name: charges, dtype: float64

Unnamed: 0,age,bmi,children,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,0.458813,-0.435748,-1.535295,False,False,False,False,True


Define and train a linear model.


In [None]:
from sklearn.linear_model import LinearRegression

# Create and fit a linear regression
model = LinearRegression()
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


Generate the predictions using the trained model.

In [None]:
# Generate predictions using the model
y_pred = model.predict(X_test)

Write the prediction results into a DataFrame using the required format, and then save that DataFrame as a CSV file.

In [None]:
# Write the predictions into a CSV
submission = pd.DataFrame({'ID': pd.Series(testing_data_encoded['ID']), 'charges':pd.Series(y_pred)})
submission.to_csv('submission.csv', index=False)
print("Submission file successfully created!")

Submission file successfully created!
