# Regression Wrangle [Acquire & Prepare]

<hr style="border:2px solid gray">

<b>What is "wrangle"?</b>
- This is the step where we acquire and prepare our data in the data science pipeline
- Combining these two steps together are oftening referred to as "wrangling"

<b>Why do we care?</b>
- This sets us for success in exploration and modeling! 

<b>How is regression wrangle different than classification wrangle?</b>
- It's <b><u>not</u></b> different! it is the same!


<hr style="border:1px solid black">

# Example: Pipeline Scenario

<b>Scenario</b>: I'm a university teacher, and I want to know when to worry about a student's progress. I want to be able to work with any students who are at high risk of failing the class, so that I can try to prevent that from happening. I have the grades of the three exams and the final grade from last semester's class. I'm hoping I can build a prediction model that will be able to use these exams to predict the final grade within 5 points average per student.

<b>Goal</b>: We are trying to predict a students final grade based on previous exam scores

<b>Target Variable</b>: final grade

In [2]:
# imports
import pandas as pd
import numpy as np

# visualization imports:
import matplotlib.pyplot as plt
import seaborn as sns

## Acquire
Plan--> **Acquire** --> Prepare--> Explore--> Model--> Deliver

<b>Goals For Acquisition:</b>
- get data
- cache a local copy
- verify it all came in 
- look at it
- understand the data
    - understand what each <u>row</u> represents
    - understand what each <u>column</u> represents

In [3]:
#assign the variable(file) to the csv
file = "https://gist.githubusercontent.com/ryanorsinger/\
14c8f919920e111f53c6d2c3a3af7e70/raw/07f6e8004fa171638d6d599cfbf0513f6f60b9e8/student_grades.csv"

In [4]:
file

'https://gist.githubusercontent.com/ryanorsinger/14c8f919920e111f53c6d2c3a3af7e70/raw/07f6e8004fa171638d6d599cfbf0513f6f60b9e8/student_grades.csv'

In [16]:
# file is just a tring that points to a url
# we can interact witht that url directly as long as it is valid,
# lets check it out
df = pd.read_csv(file, index_col = 0)
# df.set_index('student_id')

In [6]:
# look at the data
# df.head()
# df.shape()
# df.describe()
# df.info()
# df.shape()

In [11]:
# dimensions
df.shape

(104, 4)

In [12]:
# head
df.head()

Unnamed: 0_level_0,exam1,exam2,exam3,final_grade
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,100.0,90,95,96
2,98.0,93,96,95
3,85.0,83,87,87
4,83.0,80,86,85
5,93.0,90,96,97


In [17]:
# lets look at the information
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104 entries, 1 to 104
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   exam1        103 non-null    float64
 1   exam2        104 non-null    int64  
 2   exam3        104 non-null    object 
 3   final_grade  104 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 4.1+ KB


### Observations:
Based on the data:
- Independed observations for analyisis: single row equals a single student
- Have a missing value in exam one
- Exam 3 is an object instead of being a float or int. Nondesirable data type.
- Does Exam 1 need be a float?

In [18]:
df.describe()

Unnamed: 0,exam1,exam2,final_grade
count,103.0,104.0,104.0
mean,78.621359,77.307692,81.692308
std,14.260955,10.295703,10.918122
min,57.0,65.0,65.0
25%,70.0,70.0,72.0
50%,79.0,75.0,81.0
75%,92.0,89.0,93.0
max,100.0,93.0,97.0


In [21]:
df.select_dtypes('O').describe()

Unnamed: 0,exam3
count,104
unique,11
top,96
freq,16


- No real indication why exam 3 is a float

In [25]:
# look at missing value
df.isnull().sum()

exam1          1
exam2          0
exam3          0
final_grade    0
dtype: int64

In [27]:
# find where the exam is
df[df.exam1.isnull()]

Unnamed: 0_level_0,exam1,exam2,exam3,final_grade
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,,70,79,70


### Questions: 
- Why was this exam missing?
- How can we take care of it
- How do they ways we cant take care of it's impact of our analysis?

In [None]:
# investigation:


In [28]:
df = df.dropna()

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 1 to 104
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   exam1        103 non-null    float64
 1   exam2        103 non-null    int64  
 2   exam3        103 non-null    object 
 3   final_grade  103 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 4.0+ KB


In [None]:
# fix the data type
df[~df.exam3.str.isdigit()]

In [30]:
df.exam3.str.replace(' ','0')

student_id
1      95
2      96
3      87
4      86
5      96
       ..
100    78
101    79
102    70
103    75
104    78
Name: exam3, Length: 103, dtype: object

In [None]:
df.loc[:, 'exam3']