## Assignment 1:  Python Basics, Built-in Structures, Functions, Files, Numpy and Pandas

This assignment contains 3 questions with details as below. 
The due date is March 2 (Wednesday), 2022, at 11:59 PM. Each late day will result in a 20% loss of total points. Submit a `.zip` file with this solved notebook and all the files generated during the resolution of this assignment.

## Question 1 (30 points)🏅 Olympic Winter Games

Now that you have understood the basics of loading data from a CSV, let's work with a real dataset from [Kaggle](https://www.kaggle.com/the-guardian/olympic-games). You can download the two datasets from there:

- `dictionary.csv`
- `winter.csv`

Go ahead and open those two files in your text editor to try and understand what they contain. The goal of this challenge is to implement three functions that answer the following points:

1. Who won the most winter Olympic games medals (gold/silver/bronze) ever? 
2. From `min_year` to `max_year` which country won the most gold medals?
3. Find the three women with the most 5000 meters medals (gold/silver/bronze).


⚠️ For this challenge, <strong>you _can't_ use `pandas` yet</strong> 😉. Let's see how far you can go with just Python & the [`csv` module](https://docs.python.org/3/library/csv.html). In the previous link, you can find some examples of how to open and read a CSV file. Do not forget to import csv!
 ℹ️ Note that you want to use the read mode and you are working with information that maps in each row to a dict -> See the class csv.DictReader(). 

### Question 1.1 

Who won the most winter olympic games medals (gold/silver/bronze) ever? (Hint: there's just one answer)

In [5]:
import csv
from collections import Counter

def question11():
    with open('winter.csv', 'r') as f:
        c = Counter(row[-4] for row in csv.reader(f))
    c = c.most_common()
    return c[0]

question11()

('USA', 653)

### Question 1.2 

From `min_year` to `max_year`, which country won the most gold medals?


In [12]:
import csv
arr = []
with open('winter.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row["Medal"] == "Gold":
            arr.append(row["Country"])
with open('summer.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row["Medal"] == "Gold":
            arr.append(row["Country"])
arr
my_dict = {i:arr.count(i) for i in arr}
country = max(my_dict , key=my_dict.get)
country

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1788: character maps to <undefined>

### Question 1.3

Find the three women with the most 5000 meters medals (gold/silver/bronze).

In [13]:
import csv
arr = []
with open('winter.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row["Gender"] == "Women":
            arr.append(row["Country"])
        if row["Event"] == "5000M":
            arr.append(row["Country"])
with open('summer.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row["Gender"] == "Women":
            arr.append(row["Country"])
        if row["Event"] == "5000M":
            arr.append(row["Country"])
arr
my_dict = {i:arr.count(i) for i in arr}
ladies = max(my_dict , key=my_dict.get)
ladies

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1788: character maps to <undefined>

## Question 2 (35 points) Spike Triggered Average

Welcome to the world of neuroscience! <br>
In this exercise, you will have the opportunity to be gently introduced to the magnificent world of how your brain executes what you think! 

<img src="brain.jpg" alt="brain" width="50%">

The brain is perhaps our most complex vital living organ, meaning that the multitude of years of research so far is still not sufficient to understand how such an organ behaves.
 <br>

Biologically, the brain is formed by a particular type of cells that are called <strong>neurons</strong>. Such neurons are electrically excitable cells that communicate with one another via specialized connections named <strong>synapses</strong>. These connections allow the transmission of chemical and electrical signals across neurons. In very high-level terms, such a process gives birth to human thought, giving us the possibility to carry out our ordinary life tasks and habits, such as talking with friends, eating, drinking, studying, and so on...
Moreover, it is possible to say that communication across two neurons is experienced when we observe a <strong>spike</strong>, which is solicited by an electrical <strong>stimulus</strong> (through time). <br> 
As an example, imagine you are on your sofa watching your favorite movie: each photogram of the video you see from your eyes will be converted into a sequence of electrical signals, which we previously defined as <strong>stimulus</strong> (this is the scientific term). Then the stimulus flows in the network of your neurons, potentially activating their effects on your behavior. Specifically, if you watch a love scene, some neurons of the brain may <strong>spike</strong> (actually they are activated), potentially making you feel pleased and emotional. On the contrary, if you watch a violent scene, some other neurons may spike, potentially making you feel sad and uncomfortable. This is an abstract example talking about emotions, but remember that examples of this type are extensible to more practical activities, like the ones mentioned in the paragraph above.

Practically speaking, in this exercise, you will analyze a stimulus and consequently how it affects the spikes of a single neuron. <strong>Data is randomly generated</strong>, but it simulates perfectly the setting depicted above. In the end, you will compute the <strong>Spike-Triggered Average</strong>, which, given a fixed time window, approximates the stimulus's behavior before a spike occurs. This is a time-wise average, meaning that, given many <strong>fixed-in-length time-sequences</strong> (some milliseconds long in this case) of the same stimulus, we average the sequences at their values at each millisecond step. 

Some clarifications: 

You will be provided with two time series, one with the <strong>stimulus</strong> and the other one with the <strong>spikes</strong>. The latter series maps to the former, of course.
Stimulus varies in time, specifically milliseconds (ms), meaning that <strong>each element in the stimulus series is an electrical signal at a single millisecond</strong>. Whereas, the spikes are binary: 1 if a spike occurred, 0 if a spike did not occur at that specific milliseconds.

The dataset used in this exercise is <strong>data.pickle</strong>. It can be retrieved using the module pickle (https://docs.python.org/3/library/pickle.html), which is used for serializing and de-serializing a Python object structure. 

The following piece of code is held to load data. See the comments and do not change them.

In [51]:
# Import Dependencies 

import numpy as np # DON'T CHANGE THIS LINE 
import pickle # DON'T CHANGE THIS LINE 

In [52]:
# Load Data 

path = "data.pickle" # Make sure the dataset "data.pickle" is within the same folder of this notebook
data = pickle.load(open(path, 'rb')) # DON'T CHANGE THIS LINE


In [53]:
# Reference Data 

# DON'T CHANGE THESE 2 LINES 
stimulus = data['stim'] # Stimulus (in STA units) over time (in milliseconds units) - Artificial Data - type: numpy.ndarray
rho = data['rho'] # Spikes - 0 or 1. 0 no spike, 1 yes spike - Mapping stimulus - type: numpy.ndarray

### Question 2.1

How many milliseconds does the `stimulus` provided above have? 

In [63]:
stimulusabsolute = np.absolute(stimulus)
np.sum(stimulusabsolute)

24785559.580078125

### Question 2.2

Filter out the low stimulus values. 
Set a minimum threshold of 10 STA units for the <strong> absolute value </strong> of the stimulus and filter out everything below it (do not change the original stimulus array).

For example: 
<br>Consider stimulus = [-5.2345, 3.4564, 13.1245, -15.2356]<br>
The final result should be: filtered_stimulus = [13.1245, -15.2356]

Tip: Use 
<strong> print(filtered_stimulus[0:100]) </strong>
to check if the first 100 values are beeing filtered correctly.

In [66]:
filtered_stimulus = []

for i in range(len(stimulusabsolute)):
    if stimulusabsolute[i] >= 10:
        filtered_stimulus.append(stimulusabsolute[i])

print(filtered_stimulus[0:100])

[111.9482421875, 81.806640625, 10.2197265625, 83.3642578125, 69.375, 25.91796875, 32.24609375, 27.294921875, 20.244140625, 32.216796875, 35.0634765625, 55.9228515625, 46.728515625, 53.759765625, 96.8408203125, 121.6748046875, 109.58984375, 71.9287109375, 16.767578125, 17.958984375, 63.7890625, 90.2490234375, 103.4912109375, 76.3134765625, 35.654296875, 58.6181640625, 68.0517578125, 44.306640625, 38.525390625, 24.7802734375, 17.12890625, 10.478515625, 45.7861328125, 50.7763671875, 26.2353515625, 11.0986328125, 37.3388671875, 19.443359375, 36.8408203125, 55.8984375, 41.904296875, 63.359375, 87.5927734375, 53.7060546875, 29.98046875, 61.0107421875, 59.5166015625, 29.1650390625, 23.3740234375, 17.5244140625, 22.0849609375, 85.64453125, 96.2353515625, 33.0615234375, 15.8984375, 13.1982421875, 23.37890625, 56.884765625, 97.5244140625, 111.4208984375, 92.98828125, 77.7978515625, 72.041015625, 38.8818359375, 13.232421875, 11.064453125, 67.1044921875, 62.1484375, 12.55859375, 26.201171875, 62.0

### Question 2.3

Compute the interquartile range of the values of the `stimulus` time series.

  ℹ️<strong>interquartile_range = q75 - q25</strong>
    <br> Where: q25 is the first quartile and q75 is the third quartile.

In [68]:
interquartile_range = np.quantile(stimulus,0.75) - np.quantile(stimulus,0.25)
interquartile_range

72.861328125

### Question 2.4

Find the position of the three maximums of the `stimulus` and replace these values with the average (do not change the original stimulus array).

In [76]:
average = np.average(stimulus)

maxpos = np.argpartition(stimulus, -4)[-4:]

for i in maxpos:
    stimulus[i]=average

array([ 33431, 245326, 407401, 256565], dtype=int64)

### Question 2.5

Compute the <strong>Spike Triggered Average</strong> as described previously with a time window of 300 ms. I.e. each sequence to be considered for the Spike Triggered Average should have a length of 300 ms. 

Here we provide a visual toy example of the Spike Triggered Average complementing what is described in the main passage:

<img src="sta_example.png" alt="img not available" width="50%">

Each sequence is average (millisecond-wise) with a time window of 30 ms before a spike. Bear in mind that, in this question, you are asked to use 300 ms time window.

In [None]:
sti = []
f=[]
for i in range(len(rho)):
    if rho[i] == 1:
        while i > 1:
            if np.absolute(stimulus[i])-np.absolute(stimulus[i-1]) <= 300:
                sti.append(stimulus[i-1])
            f.append(sti)
            i-= 1
        break
print("I don't know")

##### 🎯Check your answer - NOT GRADED

In [None]:
import matplotlib.pyplot as plt # DON'T CHANGE THIS LINE 
import matplotlib.image as mpimg # DON'T CHANGE THIS LINE 


sta = "None" # Please delete "None" and insert here your spike triggered average answer 
plt.plot(range(sta.shape[0]), sta) # DON'T CHANGE THIS LINE 

<strong>If you did everything correct, your plot should look the same as the one below!</strong>

<img src="sta_sample.png" alt="no-picture" align="left"> <br><br><br><br><br><br><br><br><br><br><br><br><br>


<strong>Congratulations! You have just learned how the brain works! Kudos to you! :)</strong>

## Question 3 (35 points) Car Prices

This exercise consists of data preparation. You can use the libraries you have learned during this course.

### Question 3.1

Download [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset.csv) the `ML_Cars_dataset.csv`  and place it in a folder in your PC.  Load into this notebook as a pandas dataframe named `df`, and display its first 10 rows.

In [18]:
import pandas as pd 

df = pd.read_csv("ML_Cars_dataset.csv")
df.head(10)

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,64.1,2548,dohc,four,2.68,5000,expensive
1,std,front,64.1,2548,dohc,four,2.68,5000,expensive
2,std,front,65.5,2823,ohcv,six,3.47,5000,expensive
3,std,front,,2337,ohc,four,3.4,5500,expensive
4,std,front,66.4,2824,ohc,five,3.4,5500,expensive
5,std,front,66.3,2507,ohc,five,3.4,5500,expensive
6,std,front,71.4,2844,ohc,five,3.4,5500,expensive
7,std,front,,2954,ohc,five,3.4,5500,expensive
8,turbo,front,71.4,3086,ohc,five,3.4,5500,expensive
9,turbo,front,67.9,3053,ohc,five,3.4,5500,expensive


ℹ️ The description of the dataset is available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset_description.txt). Make sure to use refer to it through the exercise.

### Question 3.2 Duplicates

Remove the duplicates from the dataset if there are any. Overwite the dataframe `df`.

In [19]:
df = df.drop_duplicates(inplace=False)
df

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,64.1,2548,dohc,four,2.68,5000,expensive
2,std,front,65.5,2823,ohcv,six,3.47,5000,expensive
3,std,front,,2337,ohc,four,3.40,5500,expensive
4,std,front,66.4,2824,ohc,five,3.40,5500,expensive
5,std,front,66.3,2507,ohc,five,3.40,5500,expensive
...,...,...,...,...,...,...,...,...,...
200,std,front,68.9,2952,ohc,four,3.15,5400,expensive
201,turbo,front,68.8,3049,ohc,four,3.15,5300,expensive
202,std,front,68.9,3012,ohcv,six,2.87,5500,expensive
203,turbo,front,68.9,3217,ohc,six,3.40,4800,expensive


### Question 3.3 Missing values

Locate missing values, investigate them, and apply the solutions below accordingly (regarding all features with missing values):

- Impute with most frequent (if the feature is categorical)
- Impute with median (if the feature is numerical)

Make changes effective in the dataset `df`.

In [35]:
import numpy as np

df.notnull()
df["carwidth"] = df["carwidth"].replace("*",np.nan)
df["curbweight"] = df["curbweight"].replace(np.nan,df["curbweight"].median())
df["stroke"] = df["stroke"].replace(np.nan,df["stroke"].median())
df["peakrpm"] = df["peakrpm"].replace(np.nan,df["peakrpm"].median())

df = df.fillna(df['aspiration'].value_counts().index[0])
df = df.fillna(df['enginelocation'].value_counts().index[0])
df = df.fillna(df['enginetype'].value_counts().index[0])
df = df.fillna(df['cylindernumber'].value_counts().index[0])
df = df.fillna(df['price'].value_counts().index[0])
df.isna().sum()

aspiration        0
enginelocation    0
carwidth          0
curbweight        0
enginetype        0
cylindernumber    0
stroke            0
peakrpm           0
price             0
dtype: int64

ℹ️ <code>carwidth</code> has multiple representations of missing values. Some are <code>np.nans</code>, some are  <code>*</code>. Once located, they can be imputed by the median value, since there is less than 30% of missing values.
</details> 

### Question 3.4 Scaling

For the  `peakrpm` , `carwidth`, and `stroke` numerical features apply a scaling tecnique:
- standardization

Replace the original columns by the transformed values.

In [44]:
df["peakrpm"] = (df["peakrpm"]-df["peakrpm"].mean())/df["peakrpm"].std()
df["stroke"] = (df["stroke"]-df["stroke"].mean())/df["stroke"].std()
df["carwidth"] = df["carwidth"].astype(float, errors = 'raise')
df["carwidth"] = (df["carwidth"]-df["carwidth"].mean())/df["carwidth"].std()

df

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,-0.887886,2548,dohc,four,-1.822689,-0.239657,expensive
2,std,front,-0.220036,2823,ohcv,six,0.661621,-0.239657,expensive
3,std,front,-0.220036,2337,ohc,four,0.441492,0.819937,expensive
4,std,front,0.209296,2824,ohc,five,0.441492,0.819937,expensive
5,std,front,0.161593,2507,ohc,five,0.441492,0.819937,expensive
...,...,...,...,...,...,...,...,...,...
200,std,front,1.401886,2952,ohc,four,-0.344682,0.608018,expensive
201,turbo,front,1.354182,3049,ohc,four,-0.344682,0.396099,expensive
202,std,front,1.401886,3012,ohcv,six,-1.225197,0.819937,expensive
203,turbo,front,1.401886,3217,ohc,six,0.441492,-0.663494,expensive


### Question 3.5 Encoding

Investigate all the features that require encoding (all features that are categorical), and apply an numerical encoding.

In the dataframe, replace the original features by their encoded version(s).

ℹ️ Note that these features can have different ways to be encoding. E.g.`aspiration` and `enginelocation` are binary categorical features, however, `enginetype` is a multi categorical feature. You may choose different ways to encode.

In [48]:
df["price"].replace({"expensive":1, "cheap": 0}, inplace=True)
df["aspiration"].replace({"turbo":1, "std": 0}, inplace=True)
df["enginetype"].replace({"dohc":7, "dohcv": 6, "l": 5, "ohcv": 4, "ohcf": 3, "ohcv":2, "ohcv":1, "rotor":0}, inplace=True)
df["cylindernumber"].replace({"eight":8, "five": 5, "four": 4, "six": 6, "three": 3, "twelve": 12, "two": 2}, inplace=True)
df["enginelocation"].replace({"front":1, "rear": 0}, inplace=True)

### Question 3.6 Feature selection

Considering the collinearity on the dataset and the high correlation between some features (after the encoding process the complexity probably increased), remove unnecessary features (since they are very correlated with other features, adding little values to the dataset) -  `carwidth` and  `cylindernumber`. 

Make changes effective in the dataframe `df`.

In [49]:
df.drop(columns=['carwidth', 'cylindernumber'])

Unnamed: 0,aspiration,enginelocation,curbweight,enginetype,stroke,peakrpm,price
0,0,1,2548,0,-1.822689,-0.239657,1
2,0,1,2823,1,0.661621,-0.239657,1
3,0,1,2337,2,0.441492,0.819937,1
4,0,1,2824,2,0.441492,0.819937,1
5,0,1,2507,2,0.441492,0.819937,1
...,...,...,...,...,...,...,...
200,0,1,2952,2,-0.344682,0.608018,1
201,1,1,3049,2,-0.344682,0.396099,1
202,0,1,3012,1,-1.225197,0.819937,1
203,1,1,3217,2,0.441492,-0.663494,1


### 🏁 Save your prepared dataset in "cleaned_dataset.csv"

In [50]:
df.to_csv("cleaned_dataset.csv")