# Introduction to Pandas Workshop



# Installing Jupyter Notebooks
- Download Anaconda Navigator : https://www.anaconda.com/download 
- OR: 
- On Windows, try: 
    * pip install jupyter
    * jupyter notebook
    * If this doesnt work, you may not have pip installed or updated. Try this: 
    * python -m pip install --upgrade pip
    
- On Mac, try:
    * python3 -m ensurepip
    * pip3 install -- upgrade pip 
    * pip install jupyterlab 
    * command jupyter notebook


### What is Data Science
- I asked ChatGPT to give me a super simple explanation on what data science is: In beginner terms, data science is like a detective's job for the digital world. It involves using computer skills, statistics, and math to solve puzzles and gain insights from large sets of data.
- Basically, data science is the bridge between applied mathematics and computer science. We can use it to analyze large sets of data and make sense of them. We can also use programs and statistical concepts in order to gather inforamtion on the data or make future predictions over real-world scenarios. 

### Why Pandas?
- Pandas helps us easily analyze large sets of data when a dataset is too large for a human to go through. 
- Pandas uses a dataframe system. A DataFrame is a type of data structure that uses two-demsnions and organizes data into a tabular format (rows and columns). 
- Pandas accepts multiple file formats: 
    - CSV
    - JSON
    - Text
    - Excel
    - etc. 
- relatively easy to learn and super applicable to data science
- Overall, Pandas is a data analysis and manipulation library in Python that uses dataframes in order to work with tabular data from datasets. 


#### First we need to install Pandas:

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


#### Import the library as alias "pd":

In [2]:
import pandas as pd

- Now we'll read the CSV file and give the full directory to where your file is saved in. This will convert the file into a Pandas DataFrame format. If the CSV file is saved in the same place as this Jupyter notebook, you can just put the file name "sections.csv."

In [3]:
df=pd.read_csv("/Users/neharajganesh/Desktop/archive/sections.csv")

In [4]:
df #This is what the dataframe looks like. 

Unnamed: 0,semester,year,subject_code,course_number,section_number,a,b,c,d,f,...,diverse_3,diverse_4,diverse_5,diverse_0,feedback_1,feedback_2,feedback_3,feedback_4,feedback_5,feedback_6
0,FALL,2020,CEHD,603,600,10,0,0,0,0,...,0,2,3,0,0,0,0,0,3,2
1,FALL,2020,EDAD,601,600,4,3,0,0,0,...,2,2,2,0,0,0,0,2,2,2
2,FALL,2020,EDAD,606,700,10,0,0,0,0,...,2,0,3,0,0,0,0,0,3,4
3,FALL,2020,EDAD,608,700,42,1,0,0,2,...,2,6,11,1,1,0,2,0,11,11
4,FALL,2020,EDAD,608,701,9,0,0,0,0,...,1,1,2,0,0,0,0,0,1,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40718,FALL,2022,VTPB,927,300,63,62,23,0,0,...,4,13,28,5,1,0,3,10,16,23
40719,FALL,2022,VTPB,927,301,2,10,3,0,0,...,0,4,9,0,0,2,2,1,2,8
40720,FALL,2022,VTPB,930,300,96,49,3,0,0,...,4,18,20,3,0,0,0,3,18,26
40721,FALL,2022,VTPB,930,301,5,9,1,0,0,...,1,4,10,0,0,0,0,2,2,11


### Loading and Inspection:

In [5]:
df.shape #shows you the number of rows and columns in your DataFrame

(40723, 51)

In [6]:
selected_columns = df.loc[:, ["subject_code","course_number","a","b","c","d","f","q"]]
print(selected_columns)

      subject_code  course_number   a   b   c  d  f  q
0             CEHD            603  10   0   0  0  0  0
1             EDAD            601   4   3   0  0  0  1
2             EDAD            606  10   0   0  0  0  0
3             EDAD            608  42   1   0  0  2  0
4             EDAD            608   9   0   0  0  0  0
...            ...            ...  ..  ..  .. .. .. ..
40718         VTPB            927  63  62  23  0  0  0
40719         VTPB            927   2  10   3  0  0  0
40720         VTPB            930  96  49   3  0  0  0
40721         VTPB            930   5   9   1  0  0  0
40722         VTPB            948  57   3   0  0  0  0

[40723 rows x 8 columns]


In [7]:
df = df.iloc[:,0:11] #indexing to see all rows and only first 11 columns 
df

Unnamed: 0,semester,year,subject_code,course_number,section_number,a,b,c,d,f,q
0,FALL,2020,CEHD,603,600,10,0,0,0,0,0
1,FALL,2020,EDAD,601,600,4,3,0,0,0,1
2,FALL,2020,EDAD,606,700,10,0,0,0,0,0
3,FALL,2020,EDAD,608,700,42,1,0,0,2,0
4,FALL,2020,EDAD,608,701,9,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
40718,FALL,2022,VTPB,927,300,63,62,23,0,0,0
40719,FALL,2022,VTPB,927,301,2,10,3,0,0,0
40720,FALL,2022,VTPB,930,300,96,49,3,0,0,0
40721,FALL,2022,VTPB,930,301,5,9,1,0,0,0


In [8]:
print(df.columns) #These are the columns in our dataset after using the iloc function. 

Index(['semester', 'year', 'subject_code', 'course_number', 'section_number',
       'a', 'b', 'c', 'd', 'f', 'q'],
      dtype='object')


In [9]:
df.head(5) 

Unnamed: 0,semester,year,subject_code,course_number,section_number,a,b,c,d,f,q
0,FALL,2020,CEHD,603,600,10,0,0,0,0,0
1,FALL,2020,EDAD,601,600,4,3,0,0,0,1
2,FALL,2020,EDAD,606,700,10,0,0,0,0,0
3,FALL,2020,EDAD,608,700,42,1,0,0,2,0
4,FALL,2020,EDAD,608,701,9,0,0,0,0,0


In [10]:
df.tail(5)

Unnamed: 0,semester,year,subject_code,course_number,section_number,a,b,c,d,f,q
40718,FALL,2022,VTPB,927,300,63,62,23,0,0,0
40719,FALL,2022,VTPB,927,301,2,10,3,0,0,0
40720,FALL,2022,VTPB,930,300,96,49,3,0,0,0
40721,FALL,2022,VTPB,930,301,5,9,1,0,0,0
40722,FALL,2022,VTPB,948,302,57,3,0,0,0,0


In [11]:
filtered_df = df[(df['subject_code'] == "CHEM")]
filtered_df

Unnamed: 0,semester,year,subject_code,course_number,section_number,a,b,c,d,f,q
4408,FALL,2020,CHEM,100,500,89,18,2,2,1,2
4409,FALL,2020,CHEM,107,503,31,17,8,0,1,1
4410,FALL,2020,CHEM,107,504,79,84,37,7,6,12
4411,FALL,2020,CHEM,107,505,81,84,32,8,7,5
4412,FALL,2020,CHEM,107,506,82,91,36,9,3,6
...,...,...,...,...,...,...,...,...,...,...,...
34661,FALL,2022,CHEM,644,600,4,6,0,0,0,0
34662,FALL,2022,CHEM,646,600,13,11,1,0,0,0
34663,FALL,2022,CHEM,648,600,6,6,0,0,0,0
34664,FALL,2022,CHEM,658,600,8,0,0,0,0,0


In [None]:
df.info() #gives you a summary of the DataFrame's structure. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40723 entries, 0 to 40722
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   semester        40723 non-null  object
 1   year            40723 non-null  int64 
 2   subject_code    40723 non-null  object
 3   course_number   40723 non-null  int64 
 4   section_number  40723 non-null  object
 5   a               40723 non-null  int64 
 6   b               40723 non-null  int64 
 7   c               40723 non-null  int64 
 8   d               40723 non-null  int64 
 9   f               40723 non-null  int64 
 10  q               40723 non-null  int64 
dtypes: int64(8), object(3)
memory usage: 3.4+ MB


In [None]:
print(df[(df['subject_code'] == 'CHEM') & (df['course_number'] == '107')])

In [None]:
print(df[(df['subject_code'] == "CHEM") & (df['course_number'] == 107)])

In [None]:
ascending_df = df.sort_values(by='a', ascending=True) #prints in ascending order
print(ascending_df)

In [None]:
descending_df = df.sort_values(by='a', ascending=False) #prints in descending order
print(descending_df)

#### *Exercise: Which course has the highest number of q-drops altogether in this dataset? (Use the functions above to solve.)

In [None]:
value_counts = df['subject_code'].value_counts()
print(value_counts)



### Calculation:

In [None]:
df.describe() #gives you statisical values of the overall dataset

In [None]:
math_151 = df[(df['subject_code'] == 'MATH') & (df['course_number'] == 151)] 
math_151.describe()

#### *Exercise: Try to find the stats over one of your current TAMU courses. 

In [None]:
df_dropdf = df.drop('year', axis=1)
df_dropdf

In [None]:
result = df.groupby('semester')['q'].sum()
print(result)

In [None]:
result = df.groupby('semester')['q'].mean()
print(result)

### Manipulation

In [None]:
df.isnull()

In [None]:
filter_df = df[df.isnull().any(axis=1)]
print(filter_df)

#There are no missing dataframes in this set of data.

In [None]:
df.duplicated() #no duplicated values either, this a good dataset!

In [None]:
dirty_dummy_df = df.replace(to_replace=["CHEM", "EDAD"],
            value=None,
            inplace=False)


In [None]:
print(dirty_dummy_df['subject_code'].isna().sum()) # shows how many na in specific column

In [None]:
print(dirty_dummy_df.isna().sum()) # shows how many na in each column

In [None]:
print(dirty_dummy_df.isna().sum().sum()) # total number of na in entire dataframe

In [None]:
clean_rows_dummy_df = dirty_dummy_df.dropna(inplace=False) # drop the rows where at least one element is missing
print(clean_rows_dummy_df)

In [None]:
clean_cols_dummy_df = dirty_dummy_df.dropna(axis='columns', inplace=False) # drop the columns where at least one element is missing
print(clean_cols_dummy_df)

In [None]:
filtered_dirty_df = dirty_dummy_df[dirty_dummy_df.isnull().any(axis=1)] # same command from above
print(filtered_dirty_df)

In [None]:
filled_dummy_df = dirty_dummy_df.fillna(value="UNKNOWN", inplace=False) # replaces all NA values with given value, can be more specific
print(filled_dummy_df)

In [None]:
#Total People in each Section of Course (let's add a column):
df['Total People in Section']=df.iloc[:,5:11].sum(axis=1)
df.head(5)

#### *Final Exercise: Group the dataframe by the subject code and find the sum of each letter grade. Then, print the values of the sums for your TAMU course of choice. 

In [None]:
#Hint to get started:
result = df.groupby('subject_code')['q'].sum()
print(result)

## Sources:
- https://www.datacamp.com/tutorial/pandas
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf