# Introduction

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. </br>
Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

However, the data does not always come out of memory in the format we want it in right out of the bat. </br>
Sometimes we have to do some more work ourselves to reformat it for the task at hand.</br>

In [1]:
import pandas as pd

In [5]:
students_df = pd.read_csv("StudentPerformanceFactors.csv")
students_df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


## 1) Summary Function

Pandas provides many simple `"summary functions"` (not an official name) </br>
which restructure the data in some useful way. For example, consider the `describe()` method:

In [6]:
students_df.describe()

Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Tutoring_Sessions,Physical_Activity,Exam_Score
count,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0
mean,19.975329,79.977448,7.02906,75.070531,1.493719,2.96761,67.235659
std,5.990594,11.547475,1.46812,14.399784,1.23057,1.031231,3.890456
min,1.0,60.0,4.0,50.0,0.0,0.0,55.0
25%,16.0,70.0,6.0,63.0,1.0,2.0,65.0
50%,20.0,80.0,7.0,75.0,1.0,3.0,67.0
75%,24.0,90.0,8.0,88.0,2.0,4.0,69.0
max,44.0,100.0,10.0,100.0,8.0,6.0,101.0


This method generates a high-level summary of the attributes of the given column. </br>
It is type-aware, meaning that its output changes based on the data type of the input. </br> 
The output above only makes sense for numerical data; for string data here's what we get:

In [9]:
students_df.Access_to_Resources.describe()

count       6607
unique         3
top       Medium
freq        3319
Name: Access_to_Resources, dtype: object

if you want to get some particular simple summary statistic about a column in a DataFrame or a Series, </br>
there is usually a helpful pandas function that makes it happen.

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

In [10]:
students_df['Sleep_Hours'].mean()

np.float64(7.029060087785682)

- The average sleep hours of the students is 7.

In [11]:
students_df['Exam_Score'].mean()

np.float64(67.23565914938702)

- The average score of students in exams is 67.

To see a list of unique values we can use the `unique()` function:

In [12]:
students_df['School_Type'].unique()

array(['Public', 'Private'], dtype=object)

To see a list of unique values and how often they occur in the dataset, we can use the `value_counts()` method:

In [13]:
students_df['Physical_Activity'].value_counts() #Physical_Activity - Average number of hours of physical activity per week.

Physical_Activity
3    2545
2    1627
4    1575
1     421
5     361
0      46
6      32
Name: count, dtype: int64

In [17]:
students_df['Extracurricular_Activities'].value_counts()

Extracurricular_Activities
Yes    3938
No     2669
Name: count, dtype: int64

## 2) maps

A `map` is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values.</br> 
In data science we often have a need for creating new representations from existing data, or for transforming data from the format</br>
it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.
### a. `map()`
`map()` is the first, and slightly simpler one.</br>
- `Purpose`: Used to map each value in a Series to a corresponding value based on a function, dictionary, or Series.</br>
- `Works With`: Series only.</br>
- `Simpler Use Case`: Primarily for transforming or mapping individual elements, often using dictionaries or functions.</br>
- `Return Type`: Returns a transformed Series.</br>

For example,</br>
suppose that we wanted to remean the scores of the students in exams to 0. We can do this as follows:

In [18]:
students_examscore_mean = students_df['Exam_Score'].mean()
students_df['Exam_Score'].map(lambda  x: x - students_examscore_mean)

0      -0.235659
1      -6.235659
2       6.764341
3       3.764341
4       2.764341
          ...   
6602    0.764341
6603    1.764341
6604    0.764341
6605    0.764341
6606   -3.235659
Name: Exam_Score, Length: 6607, dtype: float64

The function you pass to `map()` should expect a single value from the Series ( `Exam_Score` value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

### b. `apply()`
`apply()` is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.</br>
- `Purpose`: Used to apply a function along an axis (rows or columns) of a DataFrame or to each element in a Series. </br>
- `Works With`: DataFrame and Series.</br>
- `Axis`: In a DataFrame, you can apply a function across rows (axis=1) or columns (axis=0).</br>
- `Flexibility`: Can handle more complex operations and works on both Series and DataFrames.</br>
- `Return Type`: Can return a DataFrame, Series, or a scalar value, depending on the function being applied. </br>

In [24]:
def remean_examScore(row):
    row.Exam_Score = row.Exam_Score - students_examscore_mean 
    return row

students_df.apply(remean_examScore, axis='columns')

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,-0.235659
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,-6.235659
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,6.764341
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,3.764341
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,2.764341
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6602,25,69,High,Medium,No,7,76,Medium,Yes,1,High,Medium,Public,Positive,2,No,High School,Near,Female,0.764341
6603,23,76,High,Medium,No,8,81,Medium,Yes,3,Low,High,Public,Positive,2,No,High School,Near,Female,1.764341
6604,20,90,Medium,Low,Yes,6,65,Low,Yes,3,Low,Medium,Public,Negative,2,No,Postgraduate,Near,Female,0.764341
6605,10,86,High,High,Yes,6,91,High,Yes,2,Low,Medium,Private,Positive,3,No,High School,Far,Female,0.764341


If we had called `students_df.apply()` with `axis='index'`, then instead of passing a function to transform each `row`, we would need to give a function to transform each column.

Note that `map()` and `apply()` return new, transformed Series and DataFrames, respectively. </br>
They don't modify the original data they're called on. If we look at the first row of `students_df`,</br>
we can see that it still has its original points value.

In Pandas,

- `axis=0` refers to rows (i.e., operations are performed along the rows).
- `axis=1` refers to columns (i.e., operations are performed along the columns).
                            
For example, if you want to drop a column, you would use `axis=1`, and if you want to drop a row, you would use `axis=0`.

In [26]:
students_df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


`Pandas` provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our `Exam_Score` column:

In [28]:
examscore_mean = students_df['Exam_Score'].mean()
students_df.Exam_Score - examscore_mean

0      -0.235659
1      -6.235659
2       6.764341
3       3.764341
4       2.764341
          ...   
6602    0.764341
6603    1.764341
6604    0.764341
6605    0.764341
6606   -3.235659
Name: Exam_Score, Length: 6607, dtype: float64

In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining `Previous_Scores` and `Exam_Score` columns of the dataset would be to do the following:

In [36]:
students_df.Previous_Scores.astype(str) + ' -- ' + students_df.Exam_Score.astype(str) 

0       73 -- 67
1       59 -- 61
2       91 -- 74
3       98 -- 71
4       65 -- 70
          ...   
6602    76 -- 68
6603    81 -- 69
6604    65 -- 68
6605    91 -- 68
6606    94 -- 64
Length: 6607, dtype: object

These operators are faster than `map()` or `apply()` because they use speed ups built into pandas.</br>
All of the standard Python operators (>, <, ==, and so on) work in this manner.</br>

However, they are not as flexible as `map()` or `apply()`, which can do more advanced things, </br>
like applying conditional logic, which cannot be done with addition and subtraction alone.

### astype()

`astype()` allows us to change the data type of `entries` in `DataFrame`.

# Questions from kaggle


In [None]:
#6. There are only so many words you can use when describing a bottle of wine. 
#Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting
#how many times each of these two words appears in the description column in the dataset. 
#(For simplicity, let's ignore the capitalized versions of these words.)

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

In [None]:
# I'm an economical wine buyer. Which wine is the "best bargain"? 
#Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

In [None]:
#We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand 
#we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less 
#than 95 is 2 stars. Any other score is 1 star. Also, the Canadian Vintners Association bought a lot of ads on the site,
#so any wines from Canada should automatically get 3 stars, regardless of points.
#Create a series star_ratings with the number of stars corresponding to each review in the dataset.

def rating_star(rows):
    if rows.country == 'Canada':
        return 3
    elif rows.points >= 95:
        return 3
    elif rows.points >= 85 and rows.points < 95:
        return 2
    else:
        return 1
star_rating = reviews.apply(rating_star, axis =1)
star_ratings = pd.Series(star_rating)

In [38]:
# What is the median of the Exam_Score column in the reviews DataFrame?

median_examscore = students_df.Exam_Score.median()
median_examscore

np.float64(67.0)

In [40]:
# What genders are in the dataset? (Your answer should not include any duplicates.)

gender = students_df.Gender.unique()
gender

array(['Male', 'Female'], dtype=object)

In [42]:
#Create variable centered_exam_score containing a version of the Exam_Score column with the mean Exam_Score subtracted.
#(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)

mean_examscore = students_df.Exam_Score.mean()
centered_exam_score = students_df.Exam_Score - mean_examscore
centered_exam_score

0      -0.235659
1      -6.235659
2       6.764341
3       3.764341
4       2.764341
          ...   
6602    0.764341
6603    1.764341
6604    0.764341
6605    0.764341
6606   -3.235659
Name: Exam_Score, Length: 6607, dtype: float64