<a href="https://colab.research.google.com/github/Prajaktahz/Uni_Colab_Work/blob/main/FBA_Week_07_Python_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://www.cs.nott.ac.uk/~pszgss/teaching/nlab.png)
# FBA Computing Session Week 7:

**Python Revision + Sklearn**

The aim of this tutorial is to cover a number of important concepts we have learned so far in view of the upcoming test. You will be writing loops, functions, conditions as well as creating lists and dictionaries.

After that we will focus on the use of libraries to ease our task, starting with Pandas to load and pre-process our data before moving to Sklearn to do our first machine learning task in Python, a linear regression!

As in the previous tutorial, we will be using the **admissions.csv** file (available from the moodle page). This file describes people who are applying for a postgraduate degree at a US university, and indicates 3 relevant features about an applicant… as well as whether they got into to the masters or PhD course they were applying for when they were selected by hand (note that if we were doing a full analytics project, this would be our target class!):

* The dataset has a binary output feature (i.e. dependent variable) called "admit".
* There are three predictor variables: gre, gpa and ranking.
* Variables gre (a exam result score) and gpa (the person’s grade point average) are continuous.
* The variable ranking takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

Let's get started, first by doing a lot of the grunt work ourselves!

### A little tip for when you develop a new function

In [1]:
# When you write a new function definition you can
# add some function calls just after to check how
# it behaves with some expected values
def my_sum(a, b):
    return a + b

print(my_sum(0, 2))
print(my_sum(-3, 1))
print(my_sum(4, 6))

2
-2
10


## Python Revision

### Step A1 - Write a function to count the number of lines in a file

Your function will take a parameter **filename** containing the path of the file to open and read. It will return the number of lines in the file.

In [2]:
!wget -O week5_data.zip "https://drive.google.com/uc?export=download&id=18C-_-ojwxWq2HztrJaSBkQkrwvpMPszZ"
!unzip week5_data.zip

--2023-11-13 13:51:03--  https://drive.google.com/uc?export=download&id=18C-_-ojwxWq2HztrJaSBkQkrwvpMPszZ
Resolving drive.google.com (drive.google.com)... 74.125.135.100, 74.125.135.138, 74.125.135.101, ...
Connecting to drive.google.com (drive.google.com)|74.125.135.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/3g7os8g4u90avcpbisqou8phpcktugcg/1699883400000/02584936932483403665/*/18C-_-ojwxWq2HztrJaSBkQkrwvpMPszZ?e=download&uuid=648611e8-6985-4581-b9b7-3753c9b6cd76 [following]
--2023-11-13 13:51:04--  https://doc-14-80-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/3g7os8g4u90avcpbisqou8phpcktugcg/1699883400000/02584936932483403665/*/18C-_-ojwxWq2HztrJaSBkQkrwvpMPszZ?e=download&uuid=648611e8-6985-4581-b9b7-3753c9b6cd76
Resolving doc-14-80-docs.googleusercontent.com (doc-14-80-docs.googleusercontent.com)... 74.125.20.132, 2607:f

In [3]:
# be careful with you file path!!
# your code here
def count_lines(filename):
  lines = 0
  with open(filename, 'r') as f:
    for line in f:
      lines += 1
  return lines

count_lines('admissions.csv')

401

### Step A2 - Loading the data manually

Our goal is to write ourselves a data loader for the file **admissions.csv**. To do that, we are going to go, step by step. The first step is to write a function that can process one line of our file (e.g. *'0,380,3.61,3'*)  and transform it into a dictionary.

Write a function **data_to_dict( )** with the following requirements:
- Parameter: txt (type string)
- Purpose : The function should create a new dictionary containing the keys **admit**, **gpa**, **gre** and **ranking**, and be populated from the string passed as a parameter. You will need to manipulate the string to extract the different values. We also want to transform the admit value to True when it is equal to 1 and False when it is equal to 0 so that we have a boolean type. For the other fields, use the data types that you feel are the most appropriate.
- Returns: The populated dictionary

In [4]:
# your code here for data_to_dict() function
def data_to_dict(data):
  data = data.split(',')
  admission_dict ={}
  admission_dict['admit'] = True if data[0] == '1' else False
  admission_dict['gre'] = int(data[1])
  admission_dict['gpa'] = float(data[2])
  admission_dict['ranking'] = int(data[3])
  return admission_dict

print(data_to_dict('0,380,3.61,3')) # You should have: {'admit': False, 'gre': 380, 'gpa': 3.61, 'ranking': 3}
print(data_to_dict('1,640,3.19,4')) # You should have: {'admit': True, 'gre': 640, 'gpa': 3.19, 'ranking': 4}

{'admit': False, 'gre': 380, 'gpa': 3.61, 'ranking': 3}
{'admit': True, 'gre': 640, 'gpa': 3.19, 'ranking': 4}


### Step A3 - Collating our data

We now have a function that can take a line of our **admissions.csv** file and return us a nicely populated dictionary. If we do that for all the lines, we can create a list containing all these dictionaries and have our data structured. Let's give it a go!

***As always, check what the text file you are trying to load is looking like. It is extremly common for first few lines to have a different structure that you will need to accomodate for (or discard). (TIP: you can use array indexes or a counter for example to skip specific lines)***

In [5]:
# your code here
list_admission = []
with open('admissions.csv', 'r') as f:
  next(f)
  for line in f:
    list_admission.append(data_to_dict(line))

print(list_admission)
print(len(list_admission))

[{'admit': False, 'gre': 380, 'gpa': 3.61, 'ranking': 3}, {'admit': True, 'gre': 660, 'gpa': 3.67, 'ranking': 3}, {'admit': True, 'gre': 800, 'gpa': 4.0, 'ranking': 1}, {'admit': True, 'gre': 640, 'gpa': 3.19, 'ranking': 4}, {'admit': False, 'gre': 520, 'gpa': 2.93, 'ranking': 4}, {'admit': True, 'gre': 760, 'gpa': 3.0, 'ranking': 2}, {'admit': True, 'gre': 560, 'gpa': 2.98, 'ranking': 1}, {'admit': False, 'gre': 400, 'gpa': 3.08, 'ranking': 2}, {'admit': True, 'gre': 540, 'gpa': 3.39, 'ranking': 3}, {'admit': False, 'gre': 700, 'gpa': 3.92, 'ranking': 2}, {'admit': False, 'gre': 800, 'gpa': 4.0, 'ranking': 4}, {'admit': False, 'gre': 440, 'gpa': 3.22, 'ranking': 1}, {'admit': True, 'gre': 760, 'gpa': 4.0, 'ranking': 1}, {'admit': False, 'gre': 700, 'gpa': 3.08, 'ranking': 2}, {'admit': True, 'gre': 700, 'gpa': 4.0, 'ranking': 1}, {'admit': False, 'gre': 480, 'gpa': 3.44, 'ranking': 3}, {'admit': False, 'gre': 780, 'gpa': 3.87, 'ranking': 4}, {'admit': False, 'gre': 360, 'gpa': 2.56, '

### Step A4 - Computing some means

You should now have a list containing 400 dictionaries. As our data is nicely structured, we can fairly easily compute the mean GRE and mean GPA. That's a good practice for the test!

In [17]:
# your code here not using NumPy
def calculate_mean(key_value):
  sum = 0
  for i in list_admission:
    #print(i['gre'])
    sum += i[key_value]

  return sum/len(list_admission)

print("GRE mean: ", calculate_mean('gre'))
print("GPA mean: ", round(calculate_mean('gpa'),4))

GRE mean:  587.7
GPA mean:  3.3899


## Using Pandas, Numpy and Sklearn this time!

### Step B1
Using the **pandas** library and its **read_csv()** function, load the data of the file **admissions.csv** into a DataFrame.

In [18]:
# your code here
import pandas as pd
df = pd.read_csv('admissions.csv')

### Step B2

This should become a good habit, each time you load new data, check that it has been loaded correctly using the **head()** method.

In [19]:
# your code here
df.head()

Unnamed: 0,admit,gre,gpa,ranking
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


### Step B3

Let's start by checking (using **Pandas** methods this time) that the mean GRE and GPA corresponds to what we have computed manually in the previous section!

In [20]:
# your code here
print(df.gre.mean())
print(df.gpa.mean())

587.7
3.3899


### Step B4 - Updating a column

For clarity we want to replace the values of the column admit to True if it is equal to 1 or False if it is equal to 0. For this, you will need to use the **loc( )** method from **Pandas**.

In [21]:
# your code here
df.loc[df.admit == 1, 'admit'] = True
df.loc[df.admit == 0, 'admit'] = False

In [22]:
# It is always good to check your operation
# had the expected output!
df.head()

Unnamed: 0,admit,gre,gpa,ranking
0,False,380,3.61,3
1,True,660,3.67,3
2,True,800,4.0,1
3,True,640,3.19,4
4,False,520,2.93,4


### Step B5 - An overview of our data

You probably remember from last week's tutorial, you can easily obtain a basic statistical summary of your data using the **describe( )** method.

In [23]:
# your code here
df.describe()

Unnamed: 0,gre,gpa,ranking
count,400.0,400.0,400.0
mean,587.7,3.3899,2.485
std,115.516536,0.380567,0.94446
min,220.0,2.26,1.0
25%,520.0,3.13,2.0
50%,580.0,3.395,2.0
75%,660.0,3.67,3.0
max,800.0,4.0,4.0


### Step B6 - An overview of a slice of your data

If we are interested to find the effect of **ranking** on the other variables, it can be interesting to obtain a statistical summary on a subset of our data. In our case, we can use the method **describe()** on the data subset for a specific ranking (e.g. what is the statistical summary when the ranking is 1? and 4? any differences?)

In [25]:
# your code here
df[df.ranking == 1].describe()
df[df.ranking == 4].describe()

Unnamed: 0,gre,gpa,ranking
count,67.0,67.0,67.0
mean,570.149254,3.318358,4.0
std,116.221999,0.360507,0.0
min,300.0,2.26,4.0
25%,500.0,3.085,4.0
50%,560.0,3.33,4.0
75%,660.0,3.54,4.0
max,800.0,4.0,4.0


### Step B7 - Number of student applications per ranking category

Remember that Pandas has a function ready made for you to do that, it was part of last week's lecture!

In [26]:
# your code here
df['ranking'].value_counts()

2    151
3    121
4     67
1     61
Name: ranking, dtype: int64

### Step B8 - Admission rate (percentage) per ranking category

This time we want to compute the admission rate (in percentage) for each ranking category (i.e. 1, 2, 3 and 4). We just want to print these percentages to the screen.

In [32]:
# your code here
new_df= df[df.admit == True]
new_df.shape

print('For rank 1: ',len(new_df[new_df.ranking == 1])/len(df) * 100)
print('For rank 2: ',len(new_df[new_df.ranking == 2])/len(df) * 100)
print('For rank 3: ',len(new_df[new_df.ranking == 3])/len(df) * 100)
print('For rank 4: ',len(new_df[new_df.ranking == 4])/len(df) * 100)

For rank 1:  8.25
For rank 2:  13.5
For rank 3:  7.000000000000001
For rank 4:  3.0


### Step B9 - Pearson Correlation

Ok, let’s now start exploring the data with a model. Pearson Correlation is the key factor in a linear regression - the further it is from zero the stronger the linear relationship (even though we all know now, that linear relationships certainly don’t tell the whole story). Let’s use **NumPy** to find that correlation:

In [33]:
import numpy as np

r = np.corrcoef(df.gpa, df.gre)
print(r)

[[1.         0.38426588]
 [0.38426588 1.        ]]


As you can see, you get a matrix output (you can check the type using **print(type(r))** if you want). The only score that matters is the first line, second column (or second line, first column which is the same) - that’s the correlation between the variables we care about. Given that this number is part of a matrix (ndarray), try to extract it using **NumPy** indexes.

In [None]:
# your code here

More than numbers, data analysts need to understand what they mean. You have now obtained a number for the Pearson Correlation of our two variables. What do you think of this correlation? Is this correlation strong, low or medium?

- Perfect: If the value is near ± 1, then it said to be a perfect correlation: as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).
- High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a strong correlation.
- Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be a medium correlation.
- Low degree: When the value lies below + .29, then it is said to be a small correlation.
- No correlation: When the value is zero (or close to zero).

'type your answer here'

### Step B10 - Importing Sklearn

This is still just a descriptive correlation score and won’t actually help us predict anything. To actually try and predict someone’s GRE exam score from their GPA average we need a linear model. You can try to do that using **Orange** with a graphical interface. It is almost as easy in Python thanks to the **sklearn** library, so let's give it a go!

The first step is to import the library using the **from** *module* **import** *submodule* syntax. From the library **sklearn** we want to import the submodule **linear_model**, try to write that import statement.

In [34]:
from sklearn import linear_model

### Step B11 - Create our features

It’s a tradition from the olden days of yore, when people still used punchcards to program computers, that when we create predictive models we always call our input features “X” and our output features “Y”. So let’s just quickly do that, we have a DataFrame containing our data from which we want to extract **gpa** (our input feature) and **gre** (our ouput feature) and store them respectively in variables **X** and **Y**.

In [35]:
X = df[['gpa']]
Y = df[['gre']]

Note that this time we don’t just use data.gpa to extract the field’s data. This is because we want it in the form of a matrix (and not a vector, which is what data.gpa would give us). If you want to see the difference try following code:

In [36]:
print(df.gpa.shape)
print(df[['gpa']].shape)

(400,)
(400, 1)


When you run this you will see that data.gpa only has one dimension (it is a vector), whereas our new version has two dimensions (albeit being a matrix with only one row in it!). The difference is subtle but just trust me for now - this means it will be in the perfect format ready to feed to sklearn as it uses matrices.

### Step B12 - Linear Regression

Create a linear regression model, not by dropping on a widget like Orange... but by typing the following couple of lines:

In [37]:
model = linear_model.LinearRegression()
model.fit(X,Y)
print(model.coef_, model.intercept_)

[[116.63935631]] [192.30424605]


Et voila. The first line creates an ‘empty’ model ready for us to train. The second line does the training by finding the correct line that ‘fits’ the data. The final line prints out the $b_1$ and the $b_0$ in the equation $y = b_1x+b_0$. Job done.

***Quick challenge:***

Any thoughts on what the connection between the gradient coefficient (coef_) and correlation (r) are? DON’T CHEAT ;)

***Answer:***

If you standardize your data the correlation is the gradient coefficient. The closer that line is to 1 (or -1) the stronger the link!

Just to prove the above (and to show you how to standardize your data - which is vital for distance measure based predictors) add the following lines before you created your linear model:

In [38]:
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
Y_scaled = preprocessing.scale(Y.astype(float))

Sklearn gives us a special set of functions to do this sort of thing. By default, scale() will standardize your data (i.e. mean centre the data and divide through by its variance), making all data “unit-less”. This means we can use it more readily in anything that compares distances, and all the datapoints are now comparable on any feature…. whatever the scale it originally used.

**Re-run your LinearRegression fitting using X_scaled and Y_scaled this time, and confirm that the correlation you calculated via numpy is the same as the linear regression gradient coefficient!**

In [None]:
# your code here

### Bonus Step - Want to practice your lists a bit more?

Try to compute the median value of GPA ***without*** using the **median()** method of **Pandas or Numpy**.

In [None]:
# your code here