<a href="https://colab.research.google.com/github/Gtherron/2nd_remote-repo/blob/main/Quick_Intro_to_Python_for_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Welcome!
---
**Before doing anything else**, please click File, then Save a copy in Drive. Work from your saved copy.

---
Follow along to import, learn about, and manipulate our data.
Feel adjust and experiment with the given code!

The first thing we need to do is upload our CSV file into Colab. Run the cell below then:

* Choose Files
* locate and select "students.csv"

In [None]:
# Upload the file from your computer
from google.colab import files
uploaded = files.upload()

Saving students.csv to students (2).csv




---



## Import Libraries - pandas, numpy

Importing libraries we'll need is one of the first things we'll always do. By having our imports all up at the top of our code, it's easy for us to look at what we've imported to make sure we have everything we need. If we run into an error because we haven't imported a certain library, it's easier to look at all of the libraries we've imported in one place rather than scanning all of the code.


*   We’ll use **pandas**, a powerful library for working with data tables (called DataFrames), to read the CSV and take a peek at what we have.
*   We'll also use **numpy**. It's faster and more flexible for numerical operations, especially in machine learning and statistics.


In [None]:
import pandas as pd
import numpy as np



---



## Load the Data
We've uploaded the **students.csv** file to our Colab environment. Now we can load it into Python so we can start exploring and analyzing the data.

In the cell below we use pandas to read the CSV. We are using a variable that we'll call "df" (short for DataFrame) to represent the CSV data. Go ahead and run the cell.

In [None]:
df = pd.read_csv('students.csv') # Notice the name of the file is in parentheses and quotation marks



---




## Preview the Data

At this point we are able to view our CSV file as a DataFrame. Let's look at the first few rows of the DataFrame so that we can verify our data loaded properly, and so that we can see the structure of our data.

To look at the first 5 rows of our data we will use our variable and the .head() method.

In [None]:
df.head()

Unnamed: 0,StudentID,FirstName,LastName,Age,Grade,Gender,MathScore,ReadingScore,ScienceScore,AttendanceRate,Hobby
0,1000,Carmen,Roberts,12,7th,Female,73,92,91,97.67,Singing
1,1001,Reese,King,17,8th,Female,97,97,54,96.23,Gaming
2,1002,Cameron,Johnson,17,12th,Male,90,51,77,86.78,Cycling
3,1003,Atlas,King,13,8th,Male,70,72,67,94.52,Soccer
4,1004,Jamie,Hall,12,7th,Female,57,100,90,93.27,Gaming


After running the cell above, you should see a matrix with rows 0-4, and the following features:

StudentID, FirstName, LastName, Age, Grade, Gender, MathScore, ReadingScore, ScienceScore, AttendanceRate, Hobby



---



## Get Dataset Overview
You should know the size, structure, and features before cleaning or analyzing. Run the cell below to get information about the data we're working with.

In [None]:
print("Shape:", df.shape) # How many Rows and columns

print("Columns:", df.columns.tolist()) # Column names

df.info() # Data types and missing values


Shape: (30, 11)
Columns: ['StudentID', 'FirstName', 'LastName', 'Age', 'Grade', 'Gender', 'MathScore', 'ReadingScore', 'ScienceScore', 'AttendanceRate', 'Hobby']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   StudentID       30 non-null     int64  
 1   FirstName       30 non-null     object 
 2   LastName        30 non-null     object 
 3   Age             30 non-null     int64  
 4   Grade           30 non-null     object 
 5   Gender          30 non-null     object 
 6   MathScore       30 non-null     int64  
 7   ReadingScore    30 non-null     int64  
 8   ScienceScore    30 non-null     int64  
 9   AttendanceRate  30 non-null     float64
 10  Hobby           30 non-null     object 
dtypes: float64(1), int64(5), object(5)
memory usage: 2.7+ KB


Notice under Dtype some of the data types are integers (int64), some are numbers with decimals (float64), and some are objects. Object data types are typically a mix of text and numbers.



---



## Summary Statistics
Now let's do some exploratory data analysis by applying some statistics to our DataFrame. Using the .describe() method will help us understand the statistics about our data for every feature.

In [None]:
print(df.describe()) # Use pandas for built-in summary stats

         StudentID        Age   MathScore  ReadingScore  ScienceScore  \
count    30.000000  30.000000   30.000000     30.000000     30.000000   
mean   1014.500000  14.633333   77.466667     76.566667     72.666667   
std       8.803408   1.938420   17.091505     14.435787     14.118546   
min    1000.000000  12.000000   51.000000     50.000000     52.000000   
25%    1007.250000  13.000000   60.750000     66.500000     61.750000   
50%    1014.500000  14.500000   76.500000     76.500000     70.000000   
75%    1021.750000  16.000000   95.250000     87.750000     85.250000   
max    1029.000000  18.000000  100.000000    100.000000    100.000000   

       AttendanceRate  
count       30.000000  
mean        93.333333  
std          3.903264  
min         86.090000  
25%         90.655000  
50%         92.585000  
75%         97.012500  
max         99.540000  


Notice the .describe() method only worked on features that had numerical values.



---



## Use NumPy for Specific Features
Let's use NumPy to get specific stats on a specific feature. In this case let's look at the stats of MathScore.

In [None]:
math_scores = df['Age'] # Use NumPy for specific calculations

print("Mean (NumPy):", np.mean(math_scores))
print("Median:", np.median(math_scores))
print("Standard deviation:", np.std(math_scores))
print("Minimum:", np.min(math_scores))
print("Maximum:", np.max(math_scores))

Mean (NumPy): 14.633333333333333
Median: 14.5
Standard deviation: 1.905838981189707
Minimum: 12
Maximum: 18


Now you can see the statistics for MathScore.

* Change MathScore to Age and see how the statistics change.
* Now change Age to Hobby. Does this work? Look at the Dtype for Hobby (You can find the Dtype in the "Get Dataset Overview" section above.)






---



## Explore Categorical Data
We can use the method .value_counts() to tell us how many things there are in a feature. This works on numbers (integers or floats) as well as words (strings).

In [None]:
print("Genders:\n", df['Gender'].value_counts()) # Count values in Gender
print("\nHobbies:\n", df['Hobby'].value_counts()) # Count values in Hobby

Genders:
 Gender
Female    17
Male      13
Name: count, dtype: int64

Hobbies:
 Hobby
Swimming       4
Soccer         3
Gaming         3
Singing        2
Cycling        2
Gardening      2
Coding         2
Photography    2
Writing        2
Chess          1
Knitting       1
Painting       1
Reading        1
Running        1
Basketball     1
Hiking         1
Cooking        1
Name: count, dtype: int64


The cell above gets the counts on features with strings. Run the code below to get the counts of a feature with integers rather than strings.

In [None]:
print("\nAge:\n", df['Age'].value_counts()) # Count values in Age


Age:
 Age
12    6
17    6
14    5
16    5
13    4
15    3
18    1
Name: count, dtype: int64




---



## One-Hot Encoding
In machine learning, models need numbers, not text. That means we need to convert categorical features like FirstName, LastName, Gender, and Hobby into a numerical format. One common way to do this is one-hot encoding. There are other methods, but we'll just focus on one-hot encoding for now.

One-hot encoding turns a category column into multiple binary (0 or 1) columns, one for each category

For example:

We have the "Gender" feature where each row has text.

| Gender |
|--------|
| Female |
| Male   |
| Female |

We can one-hot encode this feature to look like the following:

| Gender_Female | Gender_Male |
|-------|-------|
| 1     | 0     |
| 0     | 1     |
| 1     | 0     |

In [None]:
df_encoded = pd.get_dummies(df, columns=['Gender']) # One-hot encode the Gender column

df_encoded.head() # Preview the new DataFrame

Unnamed: 0,StudentID,FirstName,LastName,Age,Grade,MathScore,ReadingScore,ScienceScore,AttendanceRate,Hobby,Gender_Female,Gender_Male
0,1000,Carmen,Roberts,12,7th,73,92,91,97.67,Singing,True,False
1,1001,Reese,King,17,8th,97,97,54,96.23,Gaming,True,False
2,1002,Cameron,Johnson,17,12th,90,51,77,86.78,Cycling,False,True
3,1003,Atlas,King,13,8th,70,72,67,94.52,Soccer,False,True
4,1004,Jamie,Hall,12,7th,57,100,90,93.27,Gaming,True,False


Notice in the cell above we have created a new variable called df_encoded. This variable uses the original df variable and one-hot encodes the Gender feature.

By default, Google Colab automatically formats the one-hot encoded values as boolean (True/False) rather than binary (1/0).

Run the code below to change those features from True/False into binary

In [None]:
df_encoded[['Gender_Female', 'Gender_Male']] = df_encoded[['Gender_Female', 'Gender_Male']].astype(int)
df_encoded.head() # Preview the new DataFrame

Unnamed: 0,StudentID,FirstName,LastName,Age,Grade,MathScore,ReadingScore,ScienceScore,AttendanceRate,Hobby,Gender_Female,Gender_Male
0,1000,Carmen,Roberts,12,7th,73,92,91,97.67,Singing,1,0
1,1001,Reese,King,17,8th,97,97,54,96.23,Gaming,1,0
2,1002,Cameron,Johnson,17,12th,90,51,77,86.78,Cycling,0,1
3,1003,Atlas,King,13,8th,70,72,67,94.52,Soccer,0,1
4,1004,Jamie,Hall,12,7th,57,100,90,93.27,Gaming,1,0




---



## Feature Selection
We have some features that we don't need. We have FirstName, LastName, and StudentID. Since the StudentID feature already uniquely identifies the student, we can get rid of the FirstName and LastName features. We do this by using a .drop() method where we identify which columns we want to drop.

In [None]:
df_encoded = df_encoded.drop(columns=['FirstName', 'LastName'])
df_encoded.head()

Unnamed: 0,StudentID,Age,Grade,MathScore,ReadingScore,ScienceScore,AttendanceRate,Hobby,Gender_Female,Gender_Male
0,1000,12,7th,73,92,91,97.67,Singing,1,0
1,1001,17,8th,97,97,54,96.23,Gaming,1,0
2,1002,17,12th,90,51,77,86.78,Cycling,0,1
3,1003,13,8th,70,72,67,94.52,Soccer,0,1
4,1004,12,7th,57,100,90,93.27,Gaming,1,0




---



## One-hot Encoding

At this point all of the data in our DataFrame is a numerical value except for the Hobby feature. Using what you've learned about one-hot encoding, one-hot encode the Hobby feature below.

<font color="green"><b>Hint: Copy and paste code from above and replace with the word  
"Hobby"</b></font>



---



## 🎉 **Good job!** 🎉

You've cleaned your data, selected your features, and now you've got a DataFrame that's ready for some Machine Learning magic!
