# Welcome to the iCenter's **Pandas Crash Course** 


        brought to you by Charlie :)


Pandas is a Python package that you can use to explore data in data tables. 



In this guide, you will complete simple tasks to get you comfortable with pandas.


          Note: Here's the pandas cheatsheet https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

          It has useful pandas commands that you may want to use later on 
          in this guide, as well as in your journey as a data scientist


# Part 1: Importing pandas

The first step to using pandas is loading it up for use.

**In the cell below, import pandas so you can use it.**

      Hint: use the command "import"

Also see: https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html

In [1]:
# Import pandas here
import pandas as pd


# Part 2: Reading your dataset

Now that we've imported pandas, let's read a dataset

**Use the csv file called "charlie.csv" and store it in the variable "hello"**

    Hint: pandas can read csv files


Also see: https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html



In [3]:
# store the file in hello

hello = pd.read_csv('charlie.csv')

# Let's call on our variable to see if we did it correctly
hello

Unnamed: 0,Name,Favorite Number,Answer
0,Charlie,5,Yes
1,James,12,No
2,Katie,19,Yes
3,Xander,30,No
4,Kevin,41,Yes
5,Giang,58,Yes


There are other ways of reading datasets but reading a csv is the most common way.

By reading the csv, we have created a **DataFrame**. A DataFrame is a data structure in Python that we can work with. 

 

# Part 3: Know our way around a DataFrame

A DataFrame has rows. These rows are labeled by an **index**, an integer assigned to a row.

We can access a row by calling on the index.

Let's find out how many indexes are in our "hello" DataFrame using the **.index** function.



In [4]:
hello.index

RangeIndex(start=0, stop=6, step=1)

We can enter the index integer into **.iloc[ ]** to access all the information from a specified row.



In [5]:
# For example, if I wanted to see what is in row 3, I could do the following
hello.iloc[3]

Name               Xander
Favorite Number        30
Answer                 No
Name: 3, dtype: object

**Now you try! Look at your hello DataFrame and select the row that contains James**

In [7]:
# Try to get all the information from row that contains James

James_data = hello.iloc[1]


# Let's print our answer below

print(James_data)


Name               James
Favorite Number       12
Answer                No
Name: 1, dtype: object


As you can see above, the index corresponds with a row.

A DataFrame also has columns. 

Let's see what columns we are working with in our "hello" DataFrame using the **.columns** function

In [8]:
hello.columns

Index(['Name', 'Favorite Number', 'Answer'], dtype='object')

We can enter the index and column name into **.loc[ ]** to access the specified data.

Example: DataFrame.loc[ integer, 'Column Name']

**Using the hello DataFrame and the example above, determine the ANSWER that CHARLIE gave**


    Hint: First determine the correct index. Then, determine the column name that contain's Charlie's answer

In [11]:
# Find Charlie's Answer
charlie_answer = hello.loc[0, 'Answer']

# Let's print our answer below
print(charlie_answer)

Yes


# Part 3.5: Useful functions in a Pandas

Above we used the function .loc and .iloc. Below are a few functions that may come in
handy when you start working with data. 

Further reading: https://www.analyticsvidhya.com/blog/2021/05/pandas-functions-13-most-important/



1. .head(n) gives us the first 'n' rows of a dataframe 
2. .describe() gives us a statistical summary of the dataframe
3. .value_counts() gives the number of times a value appears
4. .drop_duplicates drops duplicate rows
5. .fillna() replaces NaN values with what we enter in

In [12]:
p3_df = pd.read_csv('partthreedf.csv')
p3_df

Unnamed: 0,name,height,weight,shoe size,color
0,Joe,77.0,200,13,red
1,Jane,63.0,100,6,blue
2,Ben,70.0,125,8,red
3,John,69.0,150,9,green
4,Joe,77.0,200,13,red
5,Alex,,175,10,purple
6,Sam,61.0,140,7,yellow


In [13]:
# example code
head_example = p3_df.head(3)

describe_ex = p3_df.describe()

vcounts_ex = p3_df['color'].value_counts()

droppings_ex = p3_df.drop_duplicates()

filled_ex = p3_df.fillna(1000)


Print each example below by removing the hashtag (#) infront of each example and see what the outputs look like.


If we are outputting a dataframe, compare the new dataframe to the old one and see what's different.

In [15]:
# Playground for you to try things 

# head_example
# describe_ex
# vcounts_ex
droppings_ex
# filled_ex

Unnamed: 0,name,height,weight,shoe size,color
0,Joe,77.0,200,13,red
1,Jane,63.0,100,6,blue
2,Ben,70.0,125,8,red
3,John,69.0,150,9,green
5,Alex,,175,10,purple
6,Sam,61.0,140,7,yellow


# Part 4: Altering your DataFrame

Sometimes our DataFrame has information that we don't want. There are ways to change our DataFrame to remove unwanted data.

What if we don't care about "Favorite Number"? We can drop the column from the "hello" DataFrame using the following code.

    DataFrame.drop(columns = "Column Name")


In [16]:
# We will remove the Favorite Number column and store the new dataframe as hello2
hello2 = hello.drop(columns = "Favorite Number")

# Let's see if we did it correctly
hello2


Unnamed: 0,Name,Answer
0,Charlie,Yes
1,James,No
2,Katie,Yes
3,Xander,No
4,Kevin,Yes
5,Giang,Yes


We can also add columns to our DataFrame. 

Below is an example of how we can add a new column to our hello DataFrame that contains our the square of our favorite numbers.

In [17]:
# We are squaring the Favorite number column and putting the values in a new column named "New Number"
hello["New Number"] = hello["Favorite Number"]**2

hello

Unnamed: 0,Name,Favorite Number,Answer,New Number
0,Charlie,5,Yes,25
1,James,12,No,144
2,Katie,19,Yes,361
3,Xander,30,No,900
4,Kevin,41,Yes,1681
5,Giang,58,Yes,3364


As you an see, we made a new column called "New Number" in our "hello" DataFrame.

We can do the same thing when given a list of values. Keep in mind, our list needs to be the same length as the DataFrame

In [18]:
# Below are two new lists
new_list_1 = ["horse", "crab", "goose", "dog", "otter", "fish"]
new_list_2 = [3, 6, 9, 12, 15, 18]

# We can straight up add the lists to the hello dataframe under a new column name
hello["animals"] = new_list_1

# Let's see how we've changed our dataframe
hello

Unnamed: 0,Name,Favorite Number,Answer,New Number,animals
0,Charlie,5,Yes,25,horse
1,James,12,No,144,crab
2,Katie,19,Yes,361,goose
3,Xander,30,No,900,dog
4,Kevin,41,Yes,1681,otter
5,Giang,58,Yes,3364,fish


Now you try! **Here's your challenge:**

**First remove the two columns "New Number" and "animals" and store it as my_df**

**Next, add new_list_2 to the hello dataframe**

**Finally, create a new column that has the product of "Favorite Number" and the numbers from new_list_2**

In [20]:
# Drop your columns below
my_df = hello.drop(columns = ['New Number', 'animals'])

# Look at your dataframe to see if it worked

my_df


Unnamed: 0,Name,Favorite Number,Answer
0,Charlie,5,Yes
1,James,12,No
2,Katie,19,Yes
3,Xander,30,No
4,Kevin,41,Yes
5,Giang,58,Yes


In [22]:
# Add new_list_2 to my_df below
my_df["mo3"] = new_list_2

# Look at the my_df and see if it worked
my_df

Unnamed: 0,Name,Favorite Number,Answer,mo3
0,Charlie,5,Yes,3
1,James,12,No,6
2,Katie,19,Yes,9
3,Xander,30,No,12
4,Kevin,41,Yes,15
5,Giang,58,Yes,18


In [23]:
# Find the product and store it as a new column in my_df
my_df['Product'] = my_df['Favorite Number']*my_df['mo3']

# See if it worked

my_df

Unnamed: 0,Name,Favorite Number,Answer,mo3,Product
0,Charlie,5,Yes,3,15
1,James,12,No,6,72
2,Katie,19,Yes,9,171
3,Xander,30,No,12,360
4,Kevin,41,Yes,15,615
5,Giang,58,Yes,18,1044


# Part 5: Creating Your Own DataFrame

We can also create a brand new DataFrame and store data into columns. 

This can be done using a $dictionary$.

Read more here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html

Below is an example that you can run to see where everything falls.


In [24]:
some_list = ['yes', 'no', 34]

my_dictionary = {'column_name': some_list, 'more numbers': [ 2, 12, 144]}
brand_new_df = pd.DataFrame(my_dictionary)
brand_new_df


Unnamed: 0,column_name,more numbers
0,yes,2
1,no,12
2,34,144


As shown above, we use the pandas.DataFrame() function 
to turn our dictionary into a DataFrame. 

Within the dictionary, we use colons to link a column 
name with data that we want to store.

The example below is a more andvanced example, a simulation.

In [25]:
import random
random.seed(1)


pancakes = []
# pancakes is an empty list that we can store values in
for i in range(10):
# we will simulate rolling a dice a 10 times
    dice1 = random.randint(1, 6)
    d = {"dice roll": dice1}
    pancakes.append(d)    
# we can append our dictionary to our empty list
dice_simulation = pd.DataFrame(pancakes)

dice_simulation

Unnamed: 0,dice roll
0,2
1,5
2,1
3,3
4,1
5,4
6,4
7,4
8,6
9,4


**Here's a challenge for you:**

**Create a DataFrame with where**

**- the first column contains the first 3 letters of the alphabet**

**- and the second column contains the first 3 letter of your name**

In [26]:
first3 = ['a','b','c']
my3 = ['c','h','a']
dict = {'col1':first3, 'col2':my3}
letters_df = pd.DataFrame(dict)
letters_df

Unnamed: 0,col1,col2
0,a,c
1,b,h
2,c,a


# Part 6: Filtering rows in a DataFrame

We can filter our DataFrame and create new DataFrames based on True/False parameters in pandas.

This is known as a Boolean expression.

Let's see how this can be applied to our DataFrame.



For instance, I want to see which rows contain the name "Kevin". In the next cell, we will see which rows fit our condition.

In [27]:
hello["Name"] == "Kevin"

0    False
1    False
2    False
3    False
4     True
5    False
Name: Name, dtype: bool

As you can see, the row with index 4 is the only one that is "True". It's the only row with the name "Kevin".

We can also look at number ranges. For instance, I want to see which rows have a "Favorite Number" greater than 30.

In [28]:
hello["Favorite Number"] >= 30

0    False
1    False
2    False
3     True
4     True
5     True
Name: Favorite Number, dtype: bool

We can apply these True/False filters to a DataFrame and get a new DataFrame of just True values

In [29]:
hello_kevin = hello[ hello["Name"] == "Kevin"]

hello_kevin

Unnamed: 0,Name,Favorite Number,Answer,New Number,animals
4,Kevin,41,Yes,1681,otter


In [30]:
hello_30 = hello[ hello["Favorite Number"] >= 30]

hello_30

Unnamed: 0,Name,Favorite Number,Answer,New Number,animals
3,Xander,30,No,900,dog
4,Kevin,41,Yes,1681,otter
5,Giang,58,Yes,3364,fish


**Using our hello DataFrame and the examples above, try to create a DataFrame of only rows where the Answer is Yes**

In [31]:
# Apply the filter of Answer == Yes to a Dataframe and store it in hello_yes

hello_yes = hello[ hello['Answer'] == 'Yes']

# Let's check if we got the correct DataFrame
hello_yes

Unnamed: 0,Name,Favorite Number,Answer,New Number,animals
0,Charlie,5,Yes,25,horse
2,Katie,19,Yes,361,goose
4,Kevin,41,Yes,1681,otter
5,Giang,58,Yes,3364,fish


# Congrats you've reached the end of the iCenter Pandas Crash Course

After completing this activity, you now know...

- what pandas is

- how to import and read csv's

- the basic structure to dataframes

- how to add, remove, and alter columns in a dataframe

- how to create your own dataframe from a dictionary

- how to filter rows in a dataframe


