# `DSML_WS_03` - Introduction to Pandas

Please work on the following tasks **before** the third workshop session.

## 1. Working with student grades in NumPy

Last week, you made yourself familiar with NumPy. Let's check your NumPy knowledge using a small case.

Imagine you are the teacher of a class of 15 students. During the year, the class has written 3 tests, each with a maximum of 100 points. You want to summarize the students' performances using NumPy.

1. Simulate the described case by creating a two-dimensional NumPy array with each row representing a student and each column representing a test. Generate random scores for each student and test between 0 and 100, and assign the array to a variable called `student_scores`.
2. Oops! You completely forgot Thomas, who joined the class during the school year after the first test. Thomas' score for the second test was 87, and 93 for the third test.
    - Since Thomas does not have a score for the first test, you want to simply use the average score of all other students. Calculate this and assign it to a variable called `avg_first_test`. (Hint: array slicing and the function np.mean() might be helpful here)
    - Add Thomas using `avg_first_test` as his first test score and his actual second and third test scores to `student_scores`.
3. You want to generate the sum of the scores from all three tests for each student. Do this using a matrix multiplication and save the resulting array to a variable called `student_totals`.
4. Finally, you want to transform the total scores in `student_totals` to a percentage of maximum available points. Assign this array to a variable called `student_pct`.

In [2]:
# your code here
import numpy as np

#1
#create a 2D array of 5 students and their scores between 0-100
students = ['Alice', 'Bob', 'Cathy', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ivy', 'Jack', 'Kelly', 'Lily', 'Mary', 'Nancy']
student_scores = np.random.randint(0,100,size = (15,5))
#print(student_scores)

#2
avg_first_test = student_scores[1,:].mean()
thomass_scores = np.array([avg_first_test,87,93,0,0])
np.append(students,'Thomas')
student_scores = np.vstack((student_scores,thomass_scores))
#print(student_scores)

#3
#find the average score of each student from the first three tests
avg_scores = student_scores[:,:3].mean(axis=1).astype(int)
#print(avg_scores)

#4
#find the sum of all three tests for each student
student_totals = student_scores[:,:3].sum(axis=1)
print(student_totals)

#5
#calculate the percentage of every student's total score
student_pct = student_totals/300
print(student_pct)


[154. 120. 122.  97. 151. 246. 184. 115. 137. 174. 115. 106. 149. 174.
  66. 235.]
[0.51333333 0.4        0.40666667 0.32333333 0.50333333 0.82
 0.61333333 0.38333333 0.45666667 0.58       0.38333333 0.35333333
 0.49666667 0.58       0.22       0.78333333]


## 2. Getting started with Pandas

This week, we will be exploring Pandas - a core package for working with data in Python. You can think of Pandas as enhanced versions of NumPy arrays. Let's see why.

As always, we first have to import pandas to use its functionalities within this Jupyter notebook. Pandas is commonly abbreviated using pd.

In [25]:
import pandas as pd

The Pandas equivalent to a one-dimensional array is a Series object, which you can create just like arrays, but use pd.Series instead of np.array. Let's stick with the student grade example from Task 1, but focus on only five students: Helena, Tom, Nina, Sam and Kim, who are 15, 15, 16, 17 and 16 years old, and scored 75, 69, 87, 88, and 54 points on the first test. Create three Pandas Series objects called `names`, `ages` and `scores` to store the respective data about our five students. How do Pandas Series objects differ from NumPy arrays?

In [30]:
# your code here
lst = ['Helena','Tom','Nina','Sam','Kim']
names = pd.Series(lst)
ages = pd.Series([15,15,16,17,16])
scores = pd.Series([75, 69, 87, 88, 54])

student_info = pd.Series([names,ages,scores])
print(student_info)

0    0    Helena
1       Tom
2      Nina
3       Sa...
1    0    15
1    15
2    16
3    17
4    16
dtype:...
2    0    75
1    69
2    87
3    88
4    54
dtype:...
dtype: object


At the heart of Pandas are dataframes, the equivalent to two-dimensional arrays. Let's combine our three Series objects into one dataframe using pd.DataFrame({'name_1': series_1, 'name_2': series_2,...}) and assign it to a variable called `students`. How does the dataframe differ from a two-dimensional array?

In [32]:
# your code here
students = pd.DataFrame({'Name':names,'Age':ages,'Score':scores})
print(students)

     Name  Age  Score
0  Helena   15     75
1     Tom   15     69
2    Nina   16     87
3     Sam   17     88
4     Kim   16     54


You can select specific information from your dataframe using the .loc[row_name, column_name] method. Return all rows but only the age column using .loc.

In [33]:
# your code here
students.loc[:,'Age']

0    15
1    15
2    16
3    17
4    16
Name: Age, dtype: int64

We can also use .loc to filter based on certain conditions. For example, if I want to only return Helena's test score, I could write `students.loc[students.name == 'Helena','score']`. Return all information on students with a score higher than 80.

In [37]:
# your code here
students.loc[students.Score > 80,:]


Unnamed: 0,Name,Age,Score
2,Nina,16,87
3,Sam,17,88
