# Introduction to Python: Part 4


Hello! Welcome to the bonus lesson!

Congratulations on completing the three previous lessons! You went over the basics of Python programming (hopefully you've done some practice afterwards), and in this lesson, we will cover an API known as Pandas! Pandas is used for data analysis which is very important in machine learning. We will actually be using Pandas for a lot of our projects.

If you already know how to use Pandas, it is alright to skip this crash course! And if you are struggling on any of these concepts, make sure to either ask someone for help or find explanations online.

And as always, to get started, copy this file into Deepnote and get right to work! :)

## What is Pandas?

Pandas is an open-source Python package which allows you to analyze, visualize, and manipulate data efficiently.


Some things Pandas is used for:


*   A fast and efficient DataFrame object for data manipulation with integrated indexing

*   Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format
*   Intelligent label-based slicing, fancy indexing, and subsetting of large data sets


*   Flexible reshaping and pivoting of data sets

Let's first get started with lists and DataFrames.




## Turning a list into a DataFrame

First off, import this notebook into Deepnote. On the right side of your screen, there should be an option to import a file.

What is a DataFrame? well, DataFrames let you store tabular data in Python.
It lets you easily store and manipulate tabular data like rows and columns. Let's do some practice with lists.

In [None]:
my_list = [1,2,3,4,5]
list_df = pd.DataFrame(my_list) #The .dataFrame() function is for creating a DataFrame through a list.
print(list_df)
print("###########")
my_better_list_df = pd.DataFrame(my_list, columns=['Number']) #Columns[] will give headers to your list's DataFrame
print(my_better_list_df)

   0
0  1
1  2
2  3
3  4
4  5
###########
   Number
0       1
1       2
2       3
3       4
4       5


## Coding Challenge #1
In this coding challenge, make a df out of the data and give each column a header saying 'Name' and 'Age'

In [None]:
data = [['Dhar',32], ['Alice', 26], ['Alec', 45], ['Bob', 1], ['Stewart', 999]] 

##YOUR CODE HERE##


      Name  Age
0     Dhar   32
1    Alice   26
2     Alec   45
3      Bob    1
4  Stewart  999




Now, let's pull up a CSV file and get to work. Go to this link: https://drive.google.com/file/d/1PAgteYSlOd8gWm-3PRYecXVE850qm0qU/view?usp=sharing and download the CSV file titled `addresses`

Pandas is used specifically for tabular-style data, so We will mostly be using it to read CSV data sets. Let's get started!

After downloading the CSV file, add it to the `Files` tab on the top right of Deepnote.


In [None]:
import pandas as pd # This is how you bring pandas into your code. There is no need to install it because it is already preinstalled!


## IGNORE THIS CODE UNLESS IN COLAB ##
#from google.colab import files
#uploaded = files.upload()

Saving addresses (1).csv to addresses (1).csv


The `read_csv` function is used to read data from a CSV file. By default, it assumes that the fields are comma seperated. The first parameter in the function is the name of the CSV file the code will read. After uploading the file, it should be called `addresses.csv` so title it that. If the CSV file is labeled something else, enter in that name directly.

In [None]:

new_df = pd.read_csv('addresses (1).csv', names=["First Name", "Last Name", "Address", "Town", "State", "Zip Code"])
#Since the CSV file does not contain any headers, we can give names to those headers by creating a names=[] array! (It is just like columns=[])
new_df[:5] #This is to retrieve the first 5 columns


Unnamed: 0,First Name,Last Name,Address,Town,State,Zip Code
0,John,Doe,120 jefferson st.,Riverside,NJ,8075
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234
4,,Blankman,,SomeTown,SD,298


After running the code, you should see a nice chart with each person's name, address, town, state, and zip code. That's a good start, but we can do even more!


In [None]:

print(new_df) #This will print the dataset

              First Name Last Name                           Address  \
0                   John       Doe                 120 jefferson st.   
1                   Jack  McGinnis                      220 hobo Av.   
2          John "Da Man"    Repici                 120 Jefferson St.   
3                Stephen     Tyler  7452 Terrace "At the Plaza" road   
4                    NaN  Blankman                               NaN   
5  Joan "the bone", Anne       Jet               9th, at Terrace plc   

          Town State  Zip Code  
0    Riverside    NJ      8075  
1        Phila    PA      9119  
2    Riverside    NJ      8075  
3     SomeTown    SD     91234  
4     SomeTown    SD       298  
5  Desert City    CO       123  


Let's say we really do not like Jack McGinnis and want to kick him out. We can do that using `skiprows`


In [None]:
better_df = pd.read_csv('addresses.csv', names=["First Name", "Last Name", "Address", "Town", "State", "Zip Code"], skiprows=[1]) #Skip second row
#FYI, skiprows=[] can take in multiple rows, so you can do skiprows=[0,1,2,3] for example. Try messing around with it.
better_df.loc(1)
print(better_df)


              First Name Last Name                           Address  \
0                   John       Doe                 120 jefferson st.   
1          John "Da Man"    Repici                 120 Jefferson St.   
2                Stephen     Tyler  7452 Terrace "At the Plaza" road   
3                    NaN  Blankman                               NaN   
4  Joan "the bone", Anne       Jet               9th, at Terrace plc   

          Town State  Zip Code  
0    Riverside    NJ      8075  
1    Riverside    NJ      8075  
2     SomeTown    SD     91234  
3     SomeTown    SD       298  
4  Desert City    CO       123  


## Removing Columns/Rows

To remove a column, we use the `df.pop(column-name)` method

In [None]:
better_df.pop("Zip Code")
print(better_df)

              First Name Last Name                           Address  \
0                   John       Doe                 120 jefferson st.   
1          John "Da Man"    Repici                 120 Jefferson St.   
2                Stephen     Tyler  7452 Terrace "At the Plaza" road   
3                    NaN  Blankman                               NaN   
4  Joan "the bone", Anne       Jet               9th, at Terrace plc   

          Town State  
0    Riverside    NJ  
1    Riverside    NJ  
2     SomeTown    SD  
3     SomeTown    SD  
4  Desert City    CO  


To remove a row, we use `df.iloc[row-number]`

In [None]:
better_df.iloc[0:]
print(better_df)

              First Name Last Name                           Address  \
0                   John       Doe                 120 jefferson st.   
1          John "Da Man"    Repici                 120 Jefferson St.   
2                Stephen     Tyler  7452 Terrace "At the Plaza" road   
3                    NaN  Blankman                               NaN   
4  Joan "the bone", Anne       Jet               9th, at Terrace plc   

          Town State  
0    Riverside    NJ  
1    Riverside    NJ  
2     SomeTown    SD  
3     SomeTown    SD  
4  Desert City    CO  


## Editing Columns

Let's practice editing columns with a different dataset. Here is the dataset: https://drive.google.com/file/d/1qcZXxvgMEqcbzU40kYbm5YRlC_iMssQT/view 
Make sure to download it onto your PC and the Deepnote.

In [None]:
## IGNORE THIS CODE UNLESS IN COLAB ##
#from google.colab import files
#uploaded = files.upload()

Saving AllDetails.csv to AllDetails.csv


This is how the DataFrame looks like when printed:

In [None]:
df = pd.read_csv("AllDetails.csv")
print(df)


   Sno  Regristation Number                     Name     RollNo Status
0    1             11913907              Molly Singh  RK19TSA01      P
1    2             11918698      Vaishnavi choudhary  RK19TSA02      P
2    3             12000722            Animesh Singh  RK19TSA03      P
3    4             12003178              Yash Panwar  RK19TSA04      P
4    5             12005300           Kallem Kruthik  RK19TSA05      P
5    6             11902981                DHAR MANN  RK19TSA06      P
6    7             11902825  Bhimavarapu Pavan Reddy  RK19TSA07      P
7    8             11902721         Mainam Siddhardh  RK19TSA08      P


Let's say we want to edit the name of Molly Singh. To do that, we will use the `df.loc[]` method. `df.loc[]` takes in 2 inputs: column number and row name. Think of it as locating the point that corresponds to the correct X and Y values.

In [None]:
df.loc[0, "Name"] = "BOB JONES"
print(df)

   Sno  Regristation Number                     Name     RollNo Status
0    1             11913907                BOB JONES  RK19TSA01      A
1    2             11918698      Vaishnavi choudhary  RK19TSA02      A
2    3             12000722            Animesh Singh  RK19TSA03      A
3    4             12003178              Yash Panwar  RK19TSA04      A
4    5             12005300           Kallem Kruthik  RK19TSA05      A
5    6             11902981                DHAR MANN  RK19TSA06      A
6    7             11902825  Bhimavarapu Pavan Reddy  RK19TSA07      A
7    8             11902721         Mainam Siddhardh  RK19TSA08      A


Another useful method is the `replace()` method. This is perfect for updating data that occurs multiple times. To use `replace()`, we have to specify the column name and need to pass the values as a dictionary into the replace() method which is in the form of key and value pair, the key will have the previous data of the column and value will have the data to be updated with.



In [None]:
# updating the column value/data
df['Status'] = df['Status'].replace({'P': 'A'})

print(df)

   Sno  Regristation Number                     Name     RollNo Status
0    1             11913907                BOB JONES  RK19TSA01      A
1    2             11918698      Vaishnavi choudhary  RK19TSA02      A
2    3             12000722            Animesh Singh  RK19TSA03      A
3    4             12003178              Yash Panwar  RK19TSA04      A
4    5             12005300           Kallem Kruthik  RK19TSA05      A
5    6             11902981                DHAR MANN  RK19TSA06      A
6    7             11902825  Bhimavarapu Pavan Reddy  RK19TSA07      A
7    8             11902721         Mainam Siddhardh  RK19TSA08      A


Finally, we can also save changes and write them into the CSV file with `df.to_csv()`

In [None]:
df.to_csv("AllDetails.csv", index=False) #We add the index=False here because When you are storing a DataFrame object into a csv file using the to_csv method, 
#we won't need to store the preceding indices of each row of the df object.
#You can avoid that by passing a False boolean value to index parameter.

print(df) #These changes will be permanent after you save them!

   Sno  Regristation Number                     Name     RollNo Status
0    1             11913907                BOB JONES  RK19TSA01      A
1    2             11918698      Vaishnavi choudhary  RK19TSA02      A
2    3             12000722            Animesh Singh  RK19TSA03      A
3    4             12003178              Yash Panwar  RK19TSA04      A
4    5             12005300           Kallem Kruthik  RK19TSA05      A
5    6             11902981                DHAR MANN  RK19TSA06      A
6    7             11902825  Bhimavarapu Pavan Reddy  RK19TSA07      A
7    8             11902721         Mainam Siddhardh  RK19TSA08      A


## Coding Challenge #2

Well, that is the (very basic) introduction to the Pandas library!

This coding challenge is very open ended. Think of it as a "sandbox mode." If you want more practice with Pandas, find a CSV file from online and add some touches to it! For example, mess around with column and row names, get rid of and delete stuff, or add some unique changes. 

Or, to go a step further, try learning more about Pandas on your own! You can learn how to make graphs, calculate summary statistics, etc. These topics will not be required to learn machine learning, but are very helpful in data science.



In [None]:
## YOUR CODE HERE ##