# Week 8: working with Scikit-Learn

In this Notebook, you will create simple linear regression models to predict temperature out of other weather features.

In [None]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

Download data from Kaggle:

In [None]:
import kagglehub

data_path = kagglehub.dataset_download("budincsevity/szeged-weather")
print("Path to dataset file:", data_path)

data_files = os.listdir(data_path)
print('Downloaded files:', data_files)

## 1. Loading and exploring data

Using Pandas, explore the dataset:
1. How many samples and features does it have?
2. What data types do columns have?
3. Does it have empty values? How many? Check for both numbers and empty strings.

Look into the pressure column values by plotting a histogram. Does it look strange? Get all unique values of this column and sort them from smallest to largest. Is there a value which should not be present in such data?

Remove all rows which have NaNs, empty strings, or weird pressure values. Leave only numerical columns. Save result into new DataFrame `df`. What percentage of data was dropped?

## Feature selection

Our features have different units and magnitudes:

In [None]:
df.describe()

We can fully remove the "Loud cover" feature, as it only contains zeros.

Another thing we need to do is data standartization.

1. Apply [StandardScaler](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html) on data.

2. The result will be a Numpy array. Convert it back into DataFrame with correct column names.

In [None]:
from sklearn.preprocessing import StandardScaler



We are interested in whether we can predict apparent temperature from other features. We will try to predict it:
- From one feature;
- From three features;
- From two features obtained with PCA.

From feature names it may be quite obvious that "Temperature (C)" may be the best predictor in this dataset. :)

1. Save values of apparent temperature from the scaled data into a new variable `y` which should be a Numpy array.
2. Choose any feature to use as a predictor. Save its values into variable `X` which should be a Numpy array. Make sure it has 2 dimensions, with second one having size 1.
3. Choose three features instead of one, and save result as `X3`. The second dimension should have size 3.

In [None]:
from sklearn.preprocessing import StandardScaler



### Running PCA

Let's create two new features out of all features we have by applying PCA.

1. Create a new Numpy array `X_init` containing only numerical columns **without apparent temperature**.
2. Create a PCA object (`PCA()`) with 2 components to output.
3. Fit it on `X_init` and apply transformation. Save the result into `X_PCA`.
4. Plot a scatterplot of transformed values. Use smaller size of points and make them half-transparent.

In [None]:
from sklearn.decomposition import PCA



Use `seaborn.pairplot` to see the relationship between all columns in `X_init`. Are these two components similar to any pair of initial features? (this plot may take longer time to finish)

In [None]:
import seaborn as sns
sns.pairplot(X_init)

## Fitting models

Using three different sets of predictor features (`X`, `X3`, and `X_PCA`), fit a [linear regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) for apparent temperature. Print out model's score (it will be R^2). Which one performed the best? Which feature is most imformative? What could we use to detect that before choosing the best predictor?

In [None]:
from sklearn.linear_model import LinearRegression
