## HW scoring clarifications

Once again, how do we score you home assignment.

- You have to fill the google form for each question (e.g. 1, 2, 3, etc.) and each 'subtask' (e.g. 1.1, 1.3; 2.3, 2.2, etc.) **you asked for** in the google sheets.

- If a question is not in the google sheet (e.g. question 13, 17), it **does not mean that they are optional**. They will be checked manually.

- You have to submit filled `.ipynb` file (very last question in the google form).

- `.ipynb` file must be **linearly executable** (`Kernel -> Restart & Run All -> No ERROR cells`). It is your task to make it so. ```If this condition is not satisfied, we will be forced to lower the grade.```

- We do not grade the scores you obtain in part 7 with your models (you are free to use different feautres). However, not normalizing *numerical* features for linear regression, using the `price` as a feature (features, that are derived from price), not dealing with categorical features will be considered as mistakes.

- You do not need to defend an assignment.

## Each student has personal set of questions

Google sheet with personal questions: https://docs.google.com/spreadsheets/d/13WGQ40WgAuwKny_oEPiBRgOmMhsxepOHQIH-dVAgaXQ/edit?usp=sharing

Every column corresponds to a single question, every row to a single student.

For example, Yusuf Abba need to report questions  1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 3.1, 3.5 etc.

## Submiting results

Google form to submit your answers: https://forms.gle/6u5qpqDtfmEKJoocA

Google form has fields for all questions, but you only need to answer **your** questions (from google sheet above).

Use your **skoltech email**. Fill your first and last names with **exactly same spelling** as in canvas system.

---

Every question has an information about the type of the answer, e.g.

> Observe top 10 observations (int)

here your answer must be a single **integer** number.

---

If your answer is a ``float number``, then it must be provided with **3 decimals after the floating point**, e.g. 1.234

---

If your answer is a ``list of float or integer numbers or str``, then they should be reported in descending (alphabetical) order, without spacing, divided by a comma, e.g.:

10.453,9.112,5.001,5.000 - **Right**

10.453, 9.112, 5.001, 5.000 - **Wrong**

---

Part of the tasks do not have corresponding fields in the google form. They are **not optional** and they will be graded manually from your .ipynb file.

---

If you have any questions regarding this Home Assignment, ask them via telegram chat, topic 'HW1'.

# Assignment 1. Traffic volume prediction.
by Anvar Kurmukov

---

By the end of this task you will be able to manipulate huge tabular data:
1. Compute different column's statistics (min, max, mean, quantiles etc.);
2. Select observations/features by condition/index;
3. Create new non-linear combinations of the columns (feature engineering);
4. Perform automated data cleaning;

and more.

---

For those who are not familiar with `pandas` we recommend these (alternative) tutorials:

1. Single notebook, covers basic pandas functionality (starting with renaming columns ending with using map, apply etc) ~ 30 short examples with links on videos https://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb . Highly recommended for everyone. (about 1-3 hours to go through)

2. https://github.com/guipsamora/pandas_exercises/ 11 topics covering all essential functionality with excersises (with solutions).

This task will be an easy ride after these tutorials.

---

We are using a public dataset compiling weather information and traffic data continuously monitored in the Twin Cities, Minnesota from 2012 to 2018. The dataset page can be found [here](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume). We've slightly modified it so please download the dataset provided on Canvas.  

You need to download `Metro_Interstate_Traffic_Volume_mod.csv` from Canvas and place it in the same directory as this notebook.


In [1]:
import numpy as np
import pandas as pd

# 1. Loading data

As always in Data Science you are starting with making nice cup of tea (or coffee). Your next move is to load the data:

- Start with loading `Metro_Interstate_Traffic_Volume_mod.csv` file using `pd.read_csv()` function.
- You may also want to increase maximal displayed pandas columns: set `pd.options.display.max_columns` to 30
- Print top 10 observations in the table. `.head()`
- Print last 10 observations in the table. `.tail()`
- Print all the data columns names using method `.columns`
- Print data size (number of rows and columns). This is the `.shape` of the data.

*Almost* every python has a `head` and a `tail` just as DataFrames do.

If you are using Google Colab, you can upload the file in the cell below. If you are NOT using Colab, set COLAB_P in the cell below to False.

In [2]:
COLAB_P = False
if COLAB_P:
    print("Upload your file, then read it with pd.read_csv()")
    from google.colab import files
    uploaded = files.upload()
    fn = list(uploaded.keys())[0]
    print("File is uploaded to ", fn)
else:
    print("Place your file to the same directory as the notebook, then read your file with pd.read_csv()")

Place your file to the same directory as the notebook, then read your file with pd.read_csv()


In [3]:
# Load the data
Floppa_Russkiy_kot = pd.read_csv("Metro_Interstate_Traffic_Volume_mod.csv")

In [4]:
# Observe top 10 observations (int)
Floppa_Russkiy_kot.head(10)

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,none,288.28,0.0,0.0,40.0,Clouds,scattered clouds,2012-10-02 09:00:00,5545.0
1,none,289.36,0.0,0.0,75.0,Clouds,broken clouds,2012-10-02 10:00:00,4516.0
2,none,289.58,0.0,0.0,90.0,Clouds,overcast clouds,2012-10-02 11:00:00,4767.0
3,none,290.13,0.0,0.0,90.0,Clouds,overcast clouds,2012-10-02 12:00:00,5026.0
4,none,291.14,0.0,0.0,75.0,Clouds,broken clouds,2012-10-02 13:00:00,4918.0
5,none,291.72,0.0,0.0,1.0,Clear,sky is clear,2012-10-02 14:00:00,5181.0
6,none,293.17,0.0,0.0,1.0,Clear,sky is clear,2012-10-02 15:00:00,5584.0
7,none,293.86,0.0,0.0,1.0,Clear,sky is clear,2012-10-02 16:00:00,6015.0
8,none,294.14,0.0,0.0,20.0,Clouds,few clouds,2012-10-02 17:00:00,5791.0
9,none,293.1,0.0,0.0,20.0,Clouds,few clouds,2012-10-02 18:00:00,4770.0


In [5]:
# Observe last 10 observations (int)
Floppa_Russkiy_kot.tail(10)

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
48194,none,283.84,0.0,0.0,75.0,Rain,proximity shower rain,2018-09-30 15:00:00,4302.0
48195,none,283.84,0.0,0.0,75.0,Drizzle,light intensity drizzle,2018-09-30 15:00:00,4302.0
48196,none,284.38,0.0,0.0,75.0,Rain,light rain,2018-09-30 16:00:00,4283.0
48197,none,284.79,0.0,0.0,75.0,Clouds,broken clouds,2018-09-30 17:00:00,4132.0
48198,none,284.2,0.25,0.0,75.0,Rain,light rain,2018-09-30 18:00:00,3947.0
48199,none,283.45,0.0,0.0,75.0,Clouds,broken clouds,2018-09-30 19:00:00,3543.0
48200,none,282.76,0.0,0.0,90.0,Clouds,overcast clouds,2018-09-30 20:00:00,2781.0
48201,none,282.73,0.0,0.0,90.0,Thunderstorm,proximity thunderstorm,2018-09-30 21:00:00,2159.0
48202,none,282.09,0.0,0.0,90.0,Clouds,overcast clouds,2018-09-30 22:00:00,1450.0
48203,none,282.12,0.0,0.0,90.0,Clouds,overcast clouds,2018-09-30 23:00:00,954.0


In [6]:
# Print all the columns/features names (int)
column_names = Floppa_Russkiy_kot.columns.tolist()
print("Column Names:")
print(column_names)


Column Names:
['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main', 'weather_description', 'date_time', 'traffic_volume']


In [7]:
# Print all the columns/features names (int)
print(Floppa_Russkiy_kot.columns)
# Q1.1 How many columns end with a vowel? @Y@ LIFES MATTER
vowels = 'aeiouyAEIOUY'
count = sum(1 for col in Floppa_Russkiy_kot.columns if col[-1] in vowels)
print(f" Q1.1# Number of columns with vowel ending: {count}")
# Q1.2 How many columns start with a vowel?
count = sum(1 for col in Floppa_Russkiy_kot.columns if col[0] in vowels)
print(f" Q1.2# Number of columns with vowel starting: {count}")
# Q1.3 Which columns are associated with the condition of weather? 
print(" Q1.3# Name of columns with weather condition:",'rain_1h', 'snow_1h', 'clouds_all', 'weather_main', 'weather_description')
# Q1.4 How many columns have `th` in their names? 
count = sum(1 for col in Floppa_Russkiy_kot.columns if 'th' in col.lower())
print(f" Q1.4# #Number of columns with 'th': {count}")



Index(['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main',
       'weather_description', 'date_time', 'traffic_volume'],
      dtype='object')
 Q1.1# Number of columns with vowel ending: 3
 Q1.2# Number of columns with vowel starting: 0
 Q1.3# Name of columns with weather condition: rain_1h snow_1h clouds_all weather_main weather_description
 Q1.4# #Number of columns with 'th': 2


In [8]:
# Print data size (int)

# Q2.1 How many observations are in the data?
num_observations = Floppa_Russkiy_kot.shape[0]
print(f"Q2.1 #Number of observations: {num_observations}")
# Q2.2 How many features are in the data?
num_features = Floppa_Russkiy_kot.shape[1]
print(f"Q2.2 #Number of features: {num_features}")


Q2.1 #Number of observations: 48204
Q2.2 #Number of features: 9


# 2. Basic data exploration

Lets do some basics:

`.count()` number of not NaN's in every column.
    
Is there any missing values in the data?     
Count number of unique values in every column .nunique().    
What does this tells you about the features, which are most likely categorical and which are most likely numerical?    
Use pandas `.describe()` to display basic statistic about the data.   
Use pandas `.value_counts()` to count number of unique values in a specific column.   
Use pandas `.min()`, `.max()`, `.mean()`, `.std()` to display specific statistics about the data.    
Use pandas `.dtypes` field to display data types in columns. 
Hint You could use `.sort_index()` or `.sort_values()` to sort the result of `.value_counts()`


In [9]:
# Display number of not NaN's in every column (int)
# Q3.1 How many NA values are in the `clouds_all` column?
# Q3.2 How many NA values are in the `temp` column?
# Q3.3 How many NA values are in the `rain_1h` column?
# Q3.4 How many NA values are in the `snow_1h` column?
# Q3.5 How many explicit NA values are in the `traffic_volume` column?
non_na_counts = Floppa_Russkiy_kot.count()
print("Non-NA Values per Column:")
print(non_na_counts)
# Q3.2 - Q.3.5
na_counts = {
    "clouds_all": Floppa_Russkiy_kot['clouds_all'].isna().sum(),
    "temp": Floppa_Russkiy_kot['temp'].isna().sum(),
    "rain_1h": Floppa_Russkiy_kot['rain_1h'].isna().sum(),
    "snow_1h": Floppa_Russkiy_kot['snow_1h'].isna().sum(),
    "traffic_volume": Floppa_Russkiy_kot['traffic_volume'].isna().sum()
}
for column, count in na_counts.items():
    print(f"NA Values in {column}: {count}")
print("Explicit NA Values in traffic_volume:", (Floppa_Russkiy_kot['traffic_volume'] == pd.NA).sum())


Non-NA Values per Column:
holiday                48204
temp                   48203
rain_1h                48203
snow_1h                48204
clouds_all             48201
weather_main           48203
weather_description    48201
date_time              48204
traffic_volume         48199
dtype: int64
NA Values in clouds_all: 3
NA Values in temp: 1
NA Values in rain_1h: 1
NA Values in snow_1h: 0
NA Values in traffic_volume: 5
Explicit NA Values in traffic_volume: 0


In [10]:
data2 = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': ['text', np.nan, pd.NaT, pd.NA],
    'C': [10, 20, 30, np.nan]
})
nan_mask = data2['B'].isna()
if nan_mask.any():
    nan_indices = data2[nan_mask].index.tolist()
    nan_values = data2[nan_mask]['B'].tolist()
    for idx, value in zip(nan_indices, nan_values):
        print(f"Index: {idx}, Value: {value}")

Index: 1, Value: nan
Index: 2, Value: NaT
Index: 3, Value: <NA>


In [11]:
data2['B'].isnull().sum()

3

In [12]:
# Now drop rows with NaN with `.dropna`. Remeber to either reassign your dataframe or provide `inplace=True` argument.
# The 'none' string is not considered NaN and should not be dropped 
Gospodi_dai_mne_sil = Floppa_Russkiy_kot.dropna()

In [13]:
# Display basic data statistics using .describe()
Gospodi_dai_mne_sil.describe()

Unnamed: 0,temp,rain_1h,snow_1h,clouds_all,traffic_volume
count,48190.0,48190.0,48190.0,48190.0,48190.0
mean,281.201366,0.334356,0.000222,49.369267,3259.859079
std,13.337406,44.795638,0.008169,39.016127,1986.972809
min,0.0,0.0,0.0,0.0,0.0
25%,272.16,0.0,0.0,1.0,1192.25
50%,282.44,0.0,0.0,64.0,3380.0
75%,291.8,0.0,0.0,90.0,4933.0
max,310.07,9831.3,0.51,100.0,7280.0


In [14]:
Gospodi_dai_mne_sil.count()

holiday                48190
temp                   48190
rain_1h                48190
snow_1h                48190
clouds_all             48190
weather_main           48190
weather_description    48190
date_time              48190
traffic_volume         48190
dtype: int64

In [15]:
# Count number of unique values in every column (int)

# Q4.1 How many unique values are in the `clouds_all` column?
print("Q4.1: ", Gospodi_dai_mne_sil['clouds_all'].nunique())
# Q4.2 How many unique values are in the `weather_main` column?
print("Q4.1: ", Gospodi_dai_mne_sil['weather_main'].nunique())

Q4.1:  60
Q4.1:  11


In [16]:
def count_unique_values(column_name, data):
    unique_values, counts = np.unique(data[column_name], return_counts=True)
    result = pd.DataFrame({column_name: unique_values, 'occurrences': counts})
    return result.sort_values(by='occurrences')

# Q5.1 For every unique `weather_main` value give its number of occurrences
print("\nQ5.1 For every unique `weather_main` value give its number of occurrences\n")
print(count_unique_values('weather_main', Gospodi_dai_mne_sil))

# Q5.2 For every unique `weather_description` value give its number of occurrences
print("\nQ5.2 For every unique `weather_description` value give its number of occurrences\n")
print(count_unique_values('weather_description', Gospodi_dai_mne_sil))



Q5.1 For every unique `weather_main` value give its number of occurrences

    weather_main  occurrences
9         Squall            4
7          Smoke           20
3            Fog          912
10  Thunderstorm         1034
4           Haze         1359
2        Drizzle         1821
8           Snow         2876
6           Rain         5671
5           Mist         5950
0          Clear        13384
1         Clouds        15159

Q5.2 For every unique `weather_description` value give its number of occurrences

                    weather_description  occurrences
26                          shower snow            1
32            thunderstorm with drizzle            2
6                         freezing rain            2
28                                sleet            3
0                               SQUALLS            4
25                       shower drizzle            6
14                  light rain and snow            6
15                    light shower snow           11
22  

In [17]:
def display_column_stats(col_name, data, rnd=3):
    column = data[col_name]
    stats = {
        'max': column.max(),
        'min': column.min(),
        'mean': column.mean(),
        'std': column.std()
    }
    rounded_stats = {k: round(v, rnd) for k, v in stats.items()}
    print(f"Statistics for '{col_name}':")
    for stat, value in rounded_stats.items():
        print(f"  {stat.capitalize()}: {value}")

# Q6.2 What are the max, min, mean and the std of the `clouds_all` column?
print('#Q6.2')
display_column_stats('clouds_all', Gospodi_dai_mne_sil)

# Q6.4 What are the max, min, mean and the std of the `rain_1h` column?
print('#Q6.4')
display_column_stats('rain_1h', Gospodi_dai_mne_sil)

#Q6.2
Statistics for 'clouds_all':
  Max: 100.0
  Min: 0.0
  Mean: 49.369
  Std: 39.016
#Q6.4
Statistics for 'rain_1h':
  Max: 9831.3
  Min: 0.0
  Mean: 0.334
  Std: 44.796


In [18]:
# Display data types of all columns (int)
Gospodi_dai_mne_sil.dtypes

holiday                 object
temp                   float64
rain_1h                float64
snow_1h                float64
clouds_all             float64
weather_main            object
weather_description     object
date_time               object
traffic_volume         float64
dtype: object

In [19]:
def count_columns_by_dtype(data, dtype):
    return data.select_dtypes(include=[dtype]).columns.size

# Q7.1 How many columns have `object` data type?
print(f"#Q7.1: {count_columns_by_dtype(Gospodi_dai_mne_sil, 'object')}")

# Q7.2 How many columns have `int64` data type?
print(f"#Q7.2: {count_columns_by_dtype(Gospodi_dai_mne_sil, 'int64')}")


#Q7.1: 4
#Q7.2: 0


# 3. Data selection

In pandas.DataFrame you could select

  Row/s by position (integer number [0 .. number of rows - 1]) .iloc or by DataFrame.index .loc:   

```
  data.loc[0]  
  data.loc[5:10]  
  data.iloc[0]  
  data.iloc[5:10]   
```

Though, this is probably the worst way to manipulate rows.   
  Columns by name

```
  data[columname]
```

  Row/s and columns

```
  data.loc[10, columname]  
  data.iloc[10, columname]  
```

Using boolean mask

```
  mask = data[columname] > value  
  data[mask]  
```

You could combine multiple conditions using & or | (and, or)   

```
cond1 = data[columname1] > value1  
cond2 = data[columname2] > value2  
data[cond1 & cond2]  
```

Using queries .query():  

```
value = 5 
data.query("columname > value")  
```

You could combine multiple conditions using and, or  

```
data.query("(columname1 > value1) and (columname2 > value2)")
```

and others. See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html for more examples.

Remember to use different quotation marks " or ' for columnname inside a query.


In [20]:
print(Gospodi_dai_mne_sil.columns)


Index(['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main',
       'weather_description', 'date_time', 'traffic_volume'],
      dtype='object')


In [21]:
def get_row_value(data, index, column):
    return data.iloc[index][column]

# Q8.2 What is the weather description of the time slot with index 999?
print(f"Q8.2: {get_row_value(Gospodi_dai_mne_sil, 999, 'weather_description')}")

# Q8.4 What is the weather main of the time slot with index 314?
print(f"Q8.4: {get_row_value(Gospodi_dai_mne_sil, 314, 'weather_main')}")

Q8.2: overcast clouds
Q8.4: Clouds


In [22]:
# Define a function to retrieve and print values from the dataframe
def get_value(df, index, column):
    value = df.loc[index][column]
    print(f"Index - {index}: The {column.replace('_', ' ').title()} on index {index} is {value}.")

# Q9.2 What is the weather description of the time slot on index 5695?
get_value(Gospodi_dai_mne_sil, 5695, 'weather_description')

# Q9.3 How much is cloud coverage on the index 1045?
get_value(Gospodi_dai_mne_sil, 1045, 'clouds_all')


Index - 5695: The Weather Description on index 5695 is overcast clouds.
Index - 1045: The Clouds All on index 1045 is 20.0.


In [23]:
# Using mask or .query syntax select rows/columns (int)

# Q10.1: How many time slots have less than 270 temperature?
mask = Gospodi_dai_mne_sil['temp'] < 270
nt = mask.sum()
print(f"# Q10.1: Number of time slots with temperature less than 270: {nt}")
# Q10.2: When was the first "light intensity drizzle" in weather description captured?
date_first = Gospodi_dai_mne_sil.query('weather_description == "light intensity drizzle"').head(1)
print(f"# Q10.2: First 'light intensity drizzle' captured at: {date_first['date_time'].values[0]}")



# Q10.1: Number of time slots with temperature less than 270: 9308
# Q10.2: First 'light intensity drizzle' captured at: 2012-10-10 07:00:00


In [24]:
# Q11.3: How much cloud coverage percentage were in sky on October 16th 2012 at 19:00?
cc = Gospodi_dai_mne_sil.query('date_time == "2012-10-16 19:00:00"')['clouds_all'].values[0]
print(f"# Q11.3: Cloud coverage percentage on October 16th 2012 at 19:00: {cc}%")

# Q11.4: What is the traffic_volume of a thirty fourth sample with clouds_all == 90?
tv = Gospodi_dai_mne_sil.query('clouds_all == 90').iloc[33]['traffic_volume']
print(f"# Q11.4: Traffic volume of the 34th sample with clouds_all == 90: {tv}")

# Q11.3: Cloud coverage percentage on October 16th 2012 at 19:00: 68.0%
# Q11.4: Traffic volume of the 34th sample with clouds_all == 90: 4329.0


In [25]:
# Q12.4: How much is the temperature the 666-th time slot with weather_description 'proximity thunderstorm'?
tep = Gospodi_dai_mne_sil.query('weather_description == "proximity thunderstorm"').iloc[665]['temp']
print(f"# Q12.4: Temperature at the 666-th time slot with weather_description 'proximity thunderstorm': {tep} K")

# Q12.5: What is the temperature of 1337-th time slot with clear sky (clouds_all <= 20)?
tes = Gospodi_dai_mne_sil.query('clouds_all <= 20').iloc[1336]['temp']
print(f"# Q12.5: Temperature at the 1337-th time slot with clear sky (clouds_all <= 20): {tes} K")

# Q12.4: Temperature at the 666-th time slot with weather_description 'proximity thunderstorm': 288.6 K
# Q12.5: Temperature at the 1337-th time slot with clear sky (clouds_all <= 20): 276.63 K


# 4. Creating new columns

Creating new column of pandas.DataFrame is as easy as:
```
data['new_awesome_column'] = [] 
```
that's it. But such a column is relatively useless. Typically, you would compute something new based on existing data and save it in a new column. For example one might want to sum a number of existing columns:
```
data['sum'] = data[col1] + data[col2] + ...
```
Pandas also provides another powerfull tool: .apply, .map(), .applymap() methods (they are kinda the same, but not quite). https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas . They allow you to apply some function to every value in the column/s (row-wise) or row (column-wise) or cell (element-wise). For example, same computations of sum using .apply():
```
data['sum'] = data[[col1, col2, col3]].apply(sum, axis=1)
```
you are not restricted to existent functions, .apply() accepts any function (including lambda functions):
```
data['sum'] = data[[col1, col2, col3]].apply(lambda x: x[0]+x[1]+x[2], axis=1)
```
or ordinary python function (if this it should have complex behaviour):
```
def _sum(x):
    total = 0
    for elem in x:
        total += elem
    return total

data['sum'] = data[[col1, col2, col3]].apply(_sum, axis=1) 
```
Many pandas methods has axis parameter axis=0 refers to rows, axis=1 refers to columns.

Warning. You should never use for loops to sum numerical elements from the container.

In [26]:
# Create new columns using the old ones (new column in your DataFrame)

# Q13.1 Create a `temp_in_celcius` column from the existing `temp` (kelvin) using any method above
Gospodi_dai_mne_sil['temp_in_celcius'] = Gospodi_dai_mne_sil['temp'] - 273.15
# Q13.2 Create a new bool column `hot` which indicates whether the time slot was hot (temp > 300)
Gospodi_dai_mne_sil['hot'] = (Gospodi_dai_mne_sil['temp'] > 300)
# Q13.3 Create a new bool column `rainy_and_cloudy` which indicates whether it was rainy (>0.1) AND cloudy (>50)
Gospodi_dai_mne_sil['rainy_and_cloudy'] = ((Gospodi_dai_mne_sil['rain_1h'] > 0.1) & (Gospodi_dai_mne_sil['clouds_all'] > 50))
# Q13.4 Create a new bool column `is_holiday` which indicates whether the day of the time slot falls on any holiday
Gospodi_dai_mne_sil['is_holiday'] = (Gospodi_dai_mne_sil['holiday'] != 'none')
# Q13.5 Create a new column `traffic_cat` by splitting a `traffic_volume` into 5 ([1..5]) distinct intervals: 0 <= x <=20%,
# 20% < x <= 40%, ... 80% < x <= 100% percentiles. You could use `.quantile()` to compute percentiles.

percentiles = Gospodi_dai_mne_sil['traffic_volume'].quantile([0.0, 0.2, 0.4, 0.6, 0.8, 1.0]).values
Gospodi_dai_mne_sil['traffic_cat'] = Gospodi_dai_mne_sil['traffic_volume'].apply(lambda x: 
                                                                 1 if x <= percentiles[1] else 
                                                                 2 if x <= percentiles[2] else 
                                                                 3 if x <= percentiles[3] else 
                                                                 4 if x <= percentiles[4] else 
                                                                 5)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Gospodi_dai_mne_sil['temp_in_celcius'] = Gospodi_dai_mne_sil['temp'] - 273.15
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Gospodi_dai_mne_sil['hot'] = (Gospodi_dai_mne_sil['temp'] > 300)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Gospodi_dai_mne_sil['rainy_and_cloudy'] = ((Gospodi_dai_mne_si

In [27]:
# Create a new column 'date_time_conv' with converted datetime format
Gospodi_dai_mne_sil['date_time_conv'] = pd.to_datetime(Gospodi_dai_mne_sil['date_time'], errors='coerce')

# Create a new column 'is_autumn' to check if date time series falls in autumn
Gospodi_dai_mne_sil['is_autumn'] = (Gospodi_dai_mne_sil['date_time_conv'].dt.month >= 9) & (Gospodi_dai_mne_sil['date_time_conv'].dt.month <= 11)

# Create a new column 'is_march_8th' to check if date time series falls on March 8th
Gospodi_dai_mne_sil['is_march_8th'] = (Gospodi_dai_mne_sil['date_time_conv'].dt.month == 3) & (Gospodi_dai_mne_sil['date_time_conv'].dt.day == 8)

# Q14.1: How many cloudy time slots were captured in autumn 2016? Including both start and end day.
autumn_2016_cloudy = Gospodi_dai_mne_sil[(Gospodi_dai_mne_sil['is_autumn']) &
                                      (Gospodi_dai_mne_sil['date_time_conv'].dt.year == 2016) &
                                      (Gospodi_dai_mne_sil['clouds_all'] > 0)]
print(f"# Q14.1:  {len(autumn_2016_cloudy)}")

# Q14.4: What is the minimum traffic volume of time slots captured on March 8th (all years), that was warmer than 290?
min_traffic_volume = Gospodi_dai_mne_sil[Gospodi_dai_mne_sil['is_march_8th']]['traffic_volume'].min()
print(f"# Q14.4:  {int(min_traffic_volume)}")

# Q14.1:  2101
# Q14.4:  248


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Gospodi_dai_mne_sil['date_time_conv'] = pd.to_datetime(Gospodi_dai_mne_sil['date_time'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Gospodi_dai_mne_sil['is_autumn'] = (Gospodi_dai_mne_sil['date_time_conv'].dt.month >= 9) & (Gospodi_dai_mne_sil['date_time_conv'].dt.month <= 11)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/

In [28]:
# Using mask or .query syntax select rows/columns and compute simple statistics (float)
# Q15.1 What was the average temperature of time slots with main weather "Haze"?
print("#Q15.1: ", round(Gospodi_dai_mne_sil[Gospodi_dai_mne_sil['weather_main'] == 'Haze']['temp'].mean(), 3))
# Q15.4: What is the median of temperatures captured in April 2017?
april_2017 = Gospodi_dai_mne_sil[(Gospodi_dai_mne_sil['date_time'].apply(lambda x: pd.to_datetime(x).month == 4)) &
                              (Gospodi_dai_mne_sil['date_time'].apply(lambda x: pd.to_datetime(x).year == 2017))]
temp_april_2017 = april_2017['temp'].median()
print(f"#Q15.4: {temp_april_2017:.3f}")

#Q15.1:  275.805
#Q15.4: 282.010


In [29]:
# Define a function to calculate and print the average value
def calculate_average(df, condition, column):
    avg_value = df.query(condition)[column].mean()
    print(f"{condition.replace('==', '_').replace('!=', '_not_').replace(' ', '_').replace('\'', '')}: The average {column.replace('_', ' ').title()} is {round(avg_value, 3)}.")

# Q16.1 What is the average temperature in celcius of the time slots with rainy_and_cloudy=True?
calculate_average(Gospodi_dai_mne_sil, "rainy_and_cloudy == True", 'temp_in_celcius')

# Q16.2 What is the average traffic volume on holidays?
calculate_average(Gospodi_dai_mne_sil, "holiday != 'none'", 'traffic_volume')

# Q16.3 What is the average traffic volume on non-holidays?
calculate_average(Gospodi_dai_mne_sil, "holiday == 'none'", 'traffic_volume')

# Q16.4 What is the average traffic volume in the highest quantile?
calculate_average(Gospodi_dai_mne_sil, "traffic_cat == 5", 'traffic_volume')

# Q16.5 What is the average traffic volume in the lowest quantile?
calculate_average(Gospodi_dai_mne_sil, "traffic_cat == 1", 'traffic_volume')

rainy_and_cloudy___True: The average Temp In Celcius is 13.585.
holiday__not__none: The average Traffic Volume is 865.443.
holiday___none: The average Traffic Volume is 3262.894.
traffic_cat___5: The average Traffic Volume is 5870.913.
traffic_cat___1: The average Traffic Volume is 485.554.


# 5. Basic date processing

You figure out that column date is to harsh for you, so you decided to convert it to a more plausible format:

- Use pandas method to_datetime() to convert the date to a good format.
- Extract year, month, day and weekday from your new date column. Save them to separate columns.
- How many columns has your data now?
- Drop column date, remember to set inplace parameter to True.

Hint: for datetime formatted date you could extract the year as follow:
```
data.date.dt.year
```
Very often date could be a ridiculously rich feature, sometimes it is holidays that matters, sometimes weekends, sometimes some special days like black friday.

Learn how to work with date in Python!


In [30]:
# Create new columns based on `Captured` column
import warnings
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)# код ооооооченб много раз мне выдавал warning
# Q17.1 Convert date to datetime format
Gospodi_dai_mne_sil['date_time'] = pd.to_datetime(Gospodi_dai_mne_sil['date_time'])

# Q17.2 Extract and store `year`
Gospodi_dai_mne_sil['year'] = Gospodi_dai_mne_sil['date_time'].dt.year

# Q17.3 Extract and store `month`
Gospodi_dai_mne_sil['month'] = Gospodi_dai_mne_sil['date_time'].dt.month

# Q17.4 Extract and store `day`
Gospodi_dai_mne_sil['day'] = Gospodi_dai_mne_sil['date_time'].dt.day

# Q17.5 Extract and store `weekday` (Monday - 0, Sunday - 6)
Gospodi_dai_mne_sil['weekday'] = Gospodi_dai_mne_sil['date_time'].dt.weekday

# Q17.6 Extract and store `hour`
Gospodi_dai_mne_sil['hour'] = Gospodi_dai_mne_sil['date_time'].dt.hour


In [31]:
#TEST
Gospodi_dai_mne_sil.head(10)

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume,temp_in_celcius,...,is_holiday,traffic_cat,date_time_conv,is_autumn,is_march_8th,year,month,day,weekday,hour
0,none,288.28,0.0,0.0,40.0,Clouds,scattered clouds,2012-10-02 09:00:00,5545.0,15.13,...,False,5,2012-10-02 09:00:00,True,False,2012,10,2,1,9
1,none,289.36,0.0,0.0,75.0,Clouds,broken clouds,2012-10-02 10:00:00,4516.0,16.21,...,False,4,2012-10-02 10:00:00,True,False,2012,10,2,1,10
2,none,289.58,0.0,0.0,90.0,Clouds,overcast clouds,2012-10-02 11:00:00,4767.0,16.43,...,False,4,2012-10-02 11:00:00,True,False,2012,10,2,1,11
3,none,290.13,0.0,0.0,90.0,Clouds,overcast clouds,2012-10-02 12:00:00,5026.0,16.98,...,False,4,2012-10-02 12:00:00,True,False,2012,10,2,1,12
4,none,291.14,0.0,0.0,75.0,Clouds,broken clouds,2012-10-02 13:00:00,4918.0,17.99,...,False,4,2012-10-02 13:00:00,True,False,2012,10,2,1,13
5,none,291.72,0.0,0.0,1.0,Clear,sky is clear,2012-10-02 14:00:00,5181.0,18.57,...,False,4,2012-10-02 14:00:00,True,False,2012,10,2,1,14
6,none,293.17,0.0,0.0,1.0,Clear,sky is clear,2012-10-02 15:00:00,5584.0,20.02,...,False,5,2012-10-02 15:00:00,True,False,2012,10,2,1,15
7,none,293.86,0.0,0.0,1.0,Clear,sky is clear,2012-10-02 16:00:00,6015.0,20.71,...,False,5,2012-10-02 16:00:00,True,False,2012,10,2,1,16
8,none,294.14,0.0,0.0,20.0,Clouds,few clouds,2012-10-02 17:00:00,5791.0,20.99,...,False,5,2012-10-02 17:00:00,True,False,2012,10,2,1,17
9,none,293.1,0.0,0.0,20.0,Clouds,few clouds,2012-10-02 18:00:00,4770.0,19.95,...,False,4,2012-10-02 18:00:00,True,False,2012,10,2,1,18


In [32]:
# Q18.4 What is the average traffic volume in the time period between 15-19 hours
print("#Q18.4: ", int(Gospodi_dai_mne_sil[(15 <= Gospodi_dai_mne_sil['hour']) & (Gospodi_dai_mne_sil['hour'] < 19)]['traffic_volume'].mean()))

# Q18.5 What is the average traffic volume on World Bicycle Day (June 3)?
def is_day(date, month, day):
    return (date.dt.month == month) & (date.dt.day == day)

print("#Q18.5: ", int(Gospodi_dai_mne_sil[is_day(Gospodi_dai_mne_sil['date_time'], 6, 3)]['traffic_volume'].mean()))


#Q18.4:  5117
#Q18.5:  3445


In [33]:
print(Gospodi_dai_mne_sil.columns)

Index(['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main',
       'weather_description', 'date_time', 'traffic_volume', 'temp_in_celcius',
       'hot', 'rainy_and_cloudy', 'is_holiday', 'traffic_cat',
       'date_time_conv', 'is_autumn', 'is_march_8th', 'year', 'month', 'day',
       'weekday', 'hour'],
      dtype='object')


# 6. Groupby

from the documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.

`.groupby()` is one of the most powerfull tool for feature engineering. Very often it is used to group object with the same categorical characteristics and compute some statistics (e.g. mean, max, etc.) of a their numerical characteric.

Instead of computing average traffic volume with for each month you could compute average traffic volumes for every month in a single command:
```
data.groupby('month')['traffic_volume'].mean()
```
You could also make multi-column groups:
```
data.groupby(['weekday','month'])['traffic_volume'].min()
```
next, you could compute multiple aggregation functions:
```
data.groupby(['weekday','month'])['traffic_volume'].agg([min, max])
```
instead of using built-in functions you could compute custom functions using apply:
```
import numpy as np
data.groupby(['weekday','month'])['traffic_volume'].apply(lambda x: np.quantile(x, .5))
```
and the coolest thing now is that you can map the results of groupby back on your DataFrame!
```
gp = data.groupby(['month'])['traffic_volume'].median()
data['gp_feature'] = data['month'].map(gp)
```
Now, if some timeslot has month == 2, its gp_feature will be equal to the median traffic volume amongst all observations in February

Read more examples in the documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html


In [34]:
# Create some groupby features
# Q19.1 `traffic_by_year` groupby `year` and compute median traffic volume.
print("\nQ19.1:\n", Gospodi_dai_mne_sil.groupby('year')['traffic_volume'].median())
# Q19.2 `traffic_by_weekday` groupby `weekday` and compute median traffic volume.
print("\nQ19.2:\n", Gospodi_dai_mne_sil.groupby('weekday')['traffic_volume'].median())
# Q19.3 `temperature_by_traffic` groupby `traffic_cat` and compute average temperature in celsius.
print("\nQ19.3:\n", Gospodi_dai_mne_sil.groupby('traffic_cat')['temp_in_celcius'].mean())


Q19.1:
 year
2012    3225.0
2013    3344.0
2014    3316.0
2015    3368.0
2016    3258.5
2017    3530.0
2018    3400.0
Name: traffic_volume, dtype: float64

Q19.2:
 weekday
0    3619.0
1    4070.0
2    4315.0
3    4280.0
4    4336.5
5    3003.0
6    2260.0
Name: traffic_volume, dtype: float64

Q19.3:
 traffic_cat
1    5.445774
2    6.031004
3    9.244710
4    9.797489
5    9.740191
Name: temp_in_celcius, dtype: float64


# 7. Building a regression model

- You do not need to normalize data for tree models, and for linear/knn models this step is essential.
- Remember, that not all of the features in the table are numeric, some of them might be viewed as categorical.
- You may create or drop any features you want - try to only keep features which you think will be relevant to the prediction of traffic volume.



In [35]:
# Q20 Separate your data into inputs and targets, keeping only relevant inputs. Drop any features computed from the output eg. `traffic_cat`
#Никогда не думал что это так удобно https://stackoverflow.com/questions/71219215/convert-year-value-to-a-periodic-value-in-pandas-dataframe
Y = Gospodi_dai_mne_sil["traffic_volume"]
Gospodi_dai_mne_sil['sin_hour'] = round(np.sin(2 * np.pi * (Gospodi_dai_mne_sil['hour']/24)), 3)
Gospodi_dai_mne_sil['cos_hour'] = round(np.cos(2 * np.pi * (Gospodi_dai_mne_sil['hour']/24)), 3)
good_columns = ['is_holiday','rain_1h', 'snow_1h', 'clouds_all', 'weather_description','temp','weekday', 'month', 'day', 'sin_hour', 'cos_hour'] #todo
Xdf = Gospodi_dai_mne_sil[good_columns]

In [36]:
print(Gospodi_dai_mne_sil.columns)

Index(['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main',
       'weather_description', 'date_time', 'traffic_volume', 'temp_in_celcius',
       'hot', 'rainy_and_cloudy', 'is_holiday', 'traffic_cat',
       'date_time_conv', 'is_autumn', 'is_march_8th', 'year', 'month', 'day',
       'weekday', 'hour', 'sin_hour', 'cos_hour'],
      dtype='object')


Now it's time to split our data into train and test sets. Generally a random split is used, but one needs to be very careful with time series data - we need to make sure train and test data don't contain mixed adjacent time slots. In general with time series, it is recommended not to predict values from the past using input information from the future (although the applicability of this rule in our case is debatable), so we'll use sklearn's [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) class here. TimeSeriesSplit splits data into a number of folds, then only provides data from past folds to train a model tested on the currently considered fold. So if we split our data into five parts, we'll get four folds:

1. Train on [0], test on [1]
2. Train on [0,1], test on [2]
3. Train on [0, 1, 2], test on [3]
4. Train on [0, 1, 2, 3], test on [4]

For the following tasks, you are required to use train and test indices from the last fold provided by TimeSeriesSplit with `n_splits` = 5.

In [37]:
# Q21 Split your data into train and test parts.
# How many records (rows) do you have in train and test tables? (list of int)?
# Use sklearn.model_selection.TimeSeriesSplit with n_splits=5

from sklearn.model_selection import TimeSeriesSplit
# Initialize TimeSeriesSplit with 5 folds
tscv = TimeSeriesSplit(n_splits=5)
# Initialize lists to store record counts
train_test_records_counts = []
train_records_counts = []
test_records_counts = []
# Split data into train and test sets
for train_indices, test_indices in tscv.split(Xdf):
    X_train, X_test = Xdf.iloc[train_indices], Xdf.iloc[test_indices]
    Y_train, Y_test = Y.iloc[train_indices], Y.iloc[test_indices]
    # Store record counts
    train_test_records_counts.append((len(X_train), len(X_test)))
    train_records_counts.append(len(X_train))
    test_records_counts.append(len(X_test))
# Print record counts for each split
for i, (train_count, test_count) in enumerate(train_test_records_counts):
    print(f'Split {i + 1}: Train records count = {train_count}, Test records count = {test_count}')


Split 1: Train records count = 8035, Test records count = 8031
Split 2: Train records count = 16066, Test records count = 8031
Split 3: Train records count = 24097, Test records count = 8031
Split 4: Train records count = 32128, Test records count = 8031
Split 5: Train records count = 40159, Test records count = 8031


In [38]:
#Test
X_train

Unnamed: 0,is_holiday,rain_1h,snow_1h,clouds_all,weather_description,temp,weekday,month,day,sin_hour,cos_hour
0,False,0.0,0.0,40.0,scattered clouds,288.28,1,10,2,0.707,-0.707
1,False,0.0,0.0,75.0,broken clouds,289.36,1,10,2,0.500,-0.866
2,False,0.0,0.0,90.0,overcast clouds,289.58,1,10,2,0.259,-0.966
3,False,0.0,0.0,90.0,overcast clouds,290.13,1,10,2,0.000,-1.000
4,False,0.0,0.0,75.0,broken clouds,291.14,1,10,2,-0.259,-0.966
...,...,...,...,...,...,...,...,...,...,...,...
40168,False,0.0,0.0,90.0,haze,260.84,3,12,28,-0.866,-0.500
40169,False,0.0,0.0,90.0,light snow,260.65,3,12,28,-0.966,-0.259
40170,False,0.0,0.0,90.0,haze,260.65,3,12,28,-0.966,-0.259
40171,False,0.0,0.0,90.0,light snow,260.76,3,12,28,-1.000,-0.000


In [39]:
X_test

Unnamed: 0,is_holiday,rain_1h,snow_1h,clouds_all,weather_description,temp,weekday,month,day,sin_hour,cos_hour
40173,False,0.0,0.0,90.0,light snow,260.29,3,12,28,-0.866,0.500
40174,False,0.0,0.0,90.0,light snow,260.46,3,12,28,-0.707,0.707
40175,False,0.0,0.0,90.0,light snow,260.00,3,12,28,-0.500,0.866
40176,False,0.0,0.0,90.0,mist,260.00,3,12,28,-0.500,0.866
40177,False,0.0,0.0,90.0,light snow,259.00,3,12,28,-0.259,0.966
...,...,...,...,...,...,...,...,...,...,...,...
48199,False,0.0,0.0,75.0,broken clouds,283.45,6,9,30,-0.966,0.259
48200,False,0.0,0.0,90.0,overcast clouds,282.76,6,9,30,-0.866,0.500
48201,False,0.0,0.0,90.0,proximity thunderstorm,282.73,6,9,30,-0.707,0.707
48202,False,0.0,0.0,90.0,overcast clouds,282.09,6,9,30,-0.500,0.866


In [40]:
Y_train

0        5545.0
1        4516.0
2        4767.0
3        5026.0
4        4918.0
          ...  
40168    5141.0
40169    4520.0
40170    4520.0
40171    3764.0
40172    2849.0
Name: traffic_volume, Length: 40159, dtype: float64

In [41]:
# Create a predictive regression model of a traffic volume.
# Q22.1 Use linear regression with l2 regularization (Ridge regression)
# Q22.2 Use decision tree regression
# Q22.3 Use k nearest neighbours regression
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
ridge_model = Ridge() 
dt_model = DecisionTreeRegressor(random_state=42)
knn_model = KNeighborsRegressor()
#test
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40159 entries, 0 to 40172
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   is_holiday           40159 non-null  bool   
 1   rain_1h              40159 non-null  float64
 2   snow_1h              40159 non-null  float64
 3   clouds_all           40159 non-null  float64
 4   weather_description  40159 non-null  object 
 5   temp                 40159 non-null  float64
 6   weekday              40159 non-null  int32  
 7   month                40159 non-null  int32  
 8   day                  40159 non-null  int32  
 9   sin_hour             40159 non-null  float64
 10  cos_hour             40159 non-null  float64
dtypes: bool(1), float64(6), int32(3), object(1)
memory usage: 2.9+ MB


In [42]:
# Я импортирую библиотеки в каждом шаге так как VS мне почему-то кидает ошибки // imorpts done for VS correct working
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np
#Иначе невозможно читать вывод// It is hard to read output without it :(
import warnings
warnings.filterwarnings("ignore")
# Defines
dt_model = DecisionTreeRegressor()
rf_model = RandomForestRegressor() #it is for me
lr_model = Ridge()
knn_model = KNeighborsRegressor()

# grid parameter for the decision tree model
dt_param_grid = {
    'regressor__max_depth': [3, 5, 10],
    'regressor__min_samples_split': [2, 5],
    'regressor__min_samples_leaf': [1, 5]
}

# grid parameter for the random forest model
rf_param_grid = {
    'regressor__n_estimators': [10, 50],
    'regressor__max_depth': [3, 5, 10],
    'regressor__min_samples_split': [2, 5],
    'regressor__min_samples_leaf': [1, 5]
}

# grid parameter for the linear regression model
lr_param_grid = {
    'regressor__alpha': [0.1, 1, 10]
}

# grid parameter for the k nearest neighbours model
knn_param_grid = {
    'regressor__n_neighbors': [3, 5, 10]
}
categorical_features = ['weather_description']
numerical_features = ['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weekday', 'month', 'day', 'sin_hour', 'cos_hour']
categorical_transformer = Pipeline(steps=[
    ('encoder', OrdinalEncoder())
])
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)
# pipelines
dt_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', dt_model)
])
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', rf_model)
])
lr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', lr_model)
])
knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', knn_model)
])
# RANDOM tree model
dt_random_search = RandomizedSearchCV(dt_pipeline, dt_param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1, n_iter=10)
dt_random_search.fit(X_train, Y_train)
# RANDOM forest model
rf_random_search = RandomizedSearchCV(rf_pipeline, rf_param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1, n_iter=10)
rf_random_search.fit(X_train, Y_train)
# RANDOM linear regression model
lr_random_search = RandomizedSearchCV(lr_pipeline, lr_param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1, n_iter=10)
lr_random_search.fit(X_train, Y_train)
# RANDOM k nearest neighbours model
knn_random_search = RandomizedSearchCV(knn_pipeline, knn_param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1, n_iter=10)
knn_random_search.fit(X_train, Y_train)
# Best models
dt_best_model = dt_random_search.best_estimator_
rf_best_model = rf_random_search.best_estimator_
lr_best_model = lr_random_search.best_estimator_
knn_best_model = knn_random_search.best_estimator_
# Another one boring checkup
print("Best Decision Tree Hyperparameters:", dt_random_search.best_params_)
print("Best Random Forest Hyperparameters:", rf_random_search.best_params_)
print("Best Linear Regression Hyperparameters:", lr_random_search.best_params_)
print("Best K Nearest Neighbours Hyperparameters:", knn_random_search.best_params_)
# It is better to place it here, for next step 
dt_y_pred = dt_best_model.predict(X_test)
rf_y_pred = rf_best_model.predict(X_test)
lr_y_pred = lr_best_model.predict(X_test)
knn_y_pred = knn_best_model.predict(X_test)
# // Этот блок для меня и моей проверки работоспособности кода // This part only to check how code works
rf_mse = mean_squared_error(Y_test, rf_y_pred)
rf_r2 = r2_score(Y_test, rf_y_pred)
print("Random Forest MSE:", rf_mse)
print("Random Forest R-squared:", rf_r2)


Fitting 3 folds for each of 10 candidates, totalling 30 fits
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best Decision Tree Hyperparameters: {'regressor__min_samples_split': 2, 'regressor__min_samples_leaf': 5, 'regressor__max_depth': 3}
Best Random Forest Hyperparameters: {'regressor__n_estimators': 10, 'regressor__min_samples_split': 5, 'regressor__min_samples_leaf': 1, 'regressor__max_depth': 10}
Best Linear Regression Hyperparameters: {'regressor__alpha': 0.1}
Best K Nearest Neighbours Hyperparameters: {'regressor__n_neighbors': 3}
Random Forest MSE: 253408.43858955184
Random Forest R-squared: 0.9346692368028776


In [43]:
# Compute train and test mean squared error for your best models (list of float).
# Q24.1 Train, test MSE using linear regression with l2 regularization
# Q24.2 Train, test MSE using decision tree regression
# Q24.3 Train, test MSE using k nearest neighbours regression

lr_mse = mean_squared_error(Y_test, lr_y_pred)
dt_mse = mean_squared_error(Y_test, dt_y_pred)
knn_mse = mean_squared_error(Y_test, knn_y_pred)
print("#Q24.1: Linear Regression MSE: {:.3f}".format(lr_mse))
print("#Q24.2: Decision Tree MSE: {:.3f}".format(dt_mse))
print("#Q24.3: K Nearest Neighbours MSE: {:.3f}".format(knn_mse))


#Q24.1: Linear Regression MSE: 1345768.238
#Q24.2: Decision Tree MSE: 534900.668
#Q24.3: K Nearest Neighbours MSE: 530974.165


In [44]:
# Compute train and test R^2 for your best models (list of float).

# Q25.1 Train, test R^2 using linear regression with l2 regularization
# Q25.2 Train, test R^2 using decision tree regression
# Q25.3 Train, test R^2 using k nearest neighbours regression
lr_r2 = r2_score(Y_test, lr_y_pred)
dt_r2 = r2_score(Y_test, dt_y_pred)
knn_r2 = r2_score(Y_test, knn_y_pred)
print("#Q25.1: Linear Regression R-squared: {:.3f}".format(lr_r2))
print("#Q25.2: Decision Tree R-squared: {:.3f}".format(dt_r2))
print("#Q25.3: K Nearest Neighbours R-squared: {:.3f}".format(knn_r2))

#Q25.1: Linear Regression R-squared: 0.653
#Q25.2: Decision Tree R-squared: 0.862
#Q25.3: K Nearest Neighbours R-squared: 0.863


In [45]:
# Q26 Which features have largest (by absolute value) weight in your linear model (top 5 features)? (list of str).
#I hope that i undertand it correctly
import warnings
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import Ridge
def encode_categorical_variables(X_train, X_test, columns):
    for column in columns:
        if column in X_train.columns and column in X_test.columns:
            le = LabelEncoder()#ну типа
            X_train[column] = le.fit_transform(X_train[column])
            X_test[column] = le.transform(X_test[column])
        else:
            print(f"Column {column} not found in X_train or X_test")
def get_top_features(X_train, weights, n):
    top_features = np.argsort(np.abs(weights))[-n:]
    return X_train.columns[top_features]
encode_categorical_variables(X_train, X_test, ['weather_description'])
ridge_model = Ridge()
ridge_model.fit(X_train, Y_train)
weights = ridge_model.coef_
top_features = get_top_features(X_train, weights, 5)
print("#Q26: Top 5 features with largest weights:", top_features)
Y_pred_train_ridge = ridge_model.predict(X_train)
Y_pred_ridge = ridge_model.predict(X_test)
#IDK what do you meeeeeen by this task

#Q26: Top 5 features with largest weights: Index(['weekday', 'is_holiday', 'sin_hour', 'snow_1h', 'cos_hour'], dtype='object')


# Make sure your .ipynb is linearly executable     
# Kernel -> Restart & Run All -> No ERROR cells

In [46]:
from PIL import Image, ImageSequence
import imageio
import numpy as np
from IPython.display import Image as display_image
with Image.open('win.gif') as im:
    frames = [frame.copy() for frame in ImageSequence.Iterator(im)]
frames = [np.array(frame.convert('RGBA')) for frame in frames]
imageio.mimsave('high_quality.gif', frames, 'GIF', fps=20)  
from IPython.display import HTML
HTML('<img src="high_quality.gif" alt="high_quality" loop="infinity">')  