# Introduction

## What is pandas?

Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool built on top of the Python programming language.
Developed by Wes McKinney in 2008.
It is also one of the most popular libraries used by data experts from all around the world.

## What can you do with pandas?

Pandas is used for data wrangling, data analysis and data visualisation.

Some examples include creating and merging dataframes, dropping unwanted columns and rows, locating and filling null values, grouping data by category, creating basic plots like barplot, scatter plot, histogram etc.

## Why should you learn to use pandas?

As humans interact more and more with technology, vast amounts of data are being generated each day. Hence, the ability to analyse these data and draw insights from them is becoming an increasingly important skill to have in the modern workforce. Organisations are progressively turning to data to help them better understand their customers and products, analyse past trends and patterns, improve operational efficiency and so on.

Here are just some of the many reasons why you should learn pandas:
- By learning pandas, you learn the fundamental ideas behind working with data as well as some skill and knowledge to code in Python
- It is straightforward to learn and you can immediately apply it to any dataset you want
- It is commonly used in the data science and machine learning community

## Where can you find pandas?

Best way to get access to pandas is by installing [Anaconda](https://docs.anaconda.com/anaconda/install/) which is a distribution of the Python and R programming languages, both of which are heavily used in data science.

By installing Anaconda, you will also have access to Jupyter notebook which is what I am using to write up this documentation. Jupyter notebook allows you to easily run your Python code cell by cell.

In [1]:
import pandas as pd

In [2]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


In [3]:
print(pd.__version__)

2.1.4


In [4]:
a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


In [5]:
myvar[0]

1

In [6]:
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


In [7]:
myvar["y"]

7

In [8]:
myvar['z']

2

In [9]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


In [10]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
dtype: int64


In [11]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)

   calories  duration
0       420        50
1       380        40
2       390        45


In [12]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


In [16]:
#refer to the row index:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


In [17]:
#use a list of indexes:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


In [18]:
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


In [19]:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


# Week 1: Reading csv files & creating your own dataframe

To use pandas, we have to first import the pandas library and the way you do that is as follows

In [25]:
# Import pandas and label it as 'pd'   "as alias"

import pandas as pd

## Reading csv files

For this part of the tutorial, you will need to download the [titanic](https://www.kaggle.com/c/titanic/data) dataset on kaggle. Once you have downloaded the file, unzip the file i.e. extract its content out of the file. Keep in mind where the file is on your compute because as we need to specify the location of the file in Jupyter notebook in order to load the data.

In [20]:
# Read data via 'pd.read_csv'
# Use the appropriate read function for different file formats, for example pd.read_excel allows you to import files in excel format

train = pd.read_csv('titanic_train.csv')
test = pd.read_csv('titanic_test.csv')

Let's have a look at our datasets

In [21]:
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [22]:
# 'head' shows the first five rows of the dataframe by default but you can specify the number of rows in the parenthesis

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [24]:
df = pd.read_csv("C:/Users/SkillCircle/Downloads/fake.csv")

In [25]:
df

Unnamed: 0,id,first_name,last_name,email,gender,Mob_No,age,Salary
0,1,Emalia,Tingle,etingle0@elpais.com,Female,9995980487,32,37177
1,2,Augusta,Moohan,amoohan1@freewebs.com,Female,5820228081,70,99823
2,3,Marni,Pidler,mpidler2@istockphoto.com,Female,8865123230,50,61846
3,4,Dewie,Doding,ddoding3@biglobe.ne.jp,Male,5525537303,25,75439
4,5,Carmina,Coonan,ccoonan4@live.com,Female,844830267,27,95270
...,...,...,...,...,...,...,...,...
995,996,Daron,Trigg,dtriggrn@theatlantic.com,Male,5937439953,48,18511
996,997,Ailene,Grumble,agrumblero@eventbrite.com,Female,4602749909,79,64264
997,998,Carolyn,Girardetti,cgirardettirp@so-net.ne.jp,Female,8116357905,22,50785
998,999,Henka,Budgett,hbudgettrq@dagondesign.com,Female,934956022,20,80692


In [26]:
# 'tail' shows the bottom five rows by default

train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [28]:
# 'shape' function tells us how many rows and columns exist in a dataframe

train.shape

(891, 12)

## Creating your own dataframe

In [9]:
# Number entries

test_scores = pd.DataFrame({'Student_ID': [154, 973, 645], 'Science': [50, 75, 31], 'Geography': [88, 100, 66],
                            'Math': [72, 86, 94]})
test_scores

Unnamed: 0,Student_ID,Science,Geography,Math
0,154,50,88,72
1,973,75,100,86
2,645,31,66,94


In [3]:
# Text entries

survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 'Emily': ['It is too sweet', 'Yum!']})
survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!


## Index

We can either set an existing column as our index or specify an index when creating a dataframe.

Let's begin by setting an an existing column as index.

In [10]:
test_scores = test_scores.set_index('Student_ID')
test_scores

Unnamed: 0_level_0,Science,Geography,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


Alternatively, we can specify an index column when creating a dataframe via the 'index' argument.

In [5]:
survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 
                       
                       'Emily': ['It is too sweet', 'Yum!']
                      }, index = ['Product A', 'Product B'])
survey

Unnamed: 0,James,Emily
Product A,I liked it,It is too sweet
Product B,It could use a bit more salt,Yum!


You can also reset the index back to its default.

In [36]:
# Reset index
# Try playing around with 'drop' and 'inplace' and see what they do

survey.reset_index(drop = True,inplace=True)

survey

Unnamed: 0,James,Emily
0,I liked it,It is too sweet
1,It could use a bit more salt,Yum!


## Renaming columns 

In [6]:
# Suppose we want to change the names of the first two columns

test_scores.rename(columns = {'Geography': 'Physics', 'Science': 'Arts'}, inplace = True)
test_scores

Unnamed: 0_level_0,Arts,Physics,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


## Dropping columns and rows

There are a few of ways you can drop columns or rows from your dataframe. In this example, I am only focusing on the 'drop' function.

In [7]:
# Drop the 'Math' column

test_scores.drop(columns ='Math',inplace=True)
test_scores

Unnamed: 0_level_0,Arts,Physics
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
154,50,88
973,75,100
645,31,66


In [8]:
# Drop row with student_ID 973
# We can make this more robust once we learn the 'loc' function in the coming weeks 
 
test_scores.drop(973,inplace=True)
test_scores

Unnamed: 0_level_0,Arts,Physics
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
154,50,88
645,31,66


## Adding columns and rows

In [11]:
test_scores

Unnamed: 0_level_0,Science,Geography,Math
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
154,50,88,72
973,75,100,86
645,31,66,94


In [12]:
# Create a new column for history subject

test_scores['History'] = [79, 70, 88]
test_scores

Unnamed: 0_level_0,Science,Geography,Math,History
Student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
154,50,88,72,79
973,75,100,86,70
645,31,66,94,88


In [13]:
# Add more product reviews from James and Emily
# Recall our survey dataframe

survey

Unnamed: 0,James,Emily
Product A,I liked it,It is too sweet
Product B,It could use a bit more salt,Yum!


In [14]:
# Existing DataFrame
survey = pd.DataFrame({'James': ['I liked it', 'It could use a bit more salt'], 
                       'Emily': ['It is too sweet', 'Yum!']
                      }, index=['Product A', 'Product B'])

# New rows to add for James and Emily
new_row = {'James': ['Not good', 'Meh']}
new_row_1 = {'Emily': ['My grandma can cook better', 'Pretty average']}

# Convert the new rows to DataFrames
new_df_james = pd.DataFrame(new_row, index=['Product C', 'Product D'])
new_df_emily = pd.DataFrame(new_row_1, index=['Product E', 'Product F'])

# Concatenate the new rows with the original DataFrame
survey = pd.concat([survey, new_df_james], axis=0)
survey = pd.concat([survey, new_df_emily], axis=0)

# Print updated DataFrame
print(survey)

                                  James                       Emily
Product A                    I liked it             It is too sweet
Product B  It could use a bit more salt                        Yum!
Product C                      Not good                         NaN
Product D                           Meh                         NaN
Product E                           NaN  My grandma can cook better
Product F                           NaN              Pretty average


In [15]:
survey['James'] = ['Not good', 'Meh']

ValueError: Length of values (2) does not match length of index (6)

In [16]:
survey['Emily'] = ['My grandma can cook better', 'Pretty average']

ValueError: Length of values (2) does not match length of index (6)

In [17]:
survey

Unnamed: 0,James,Emily
Product A,I liked it,It is too sweet
Product B,It could use a bit more salt,Yum!
Product C,Not good,
Product D,Meh,
Product E,,My grandma can cook better
Product F,,Pretty average


In [18]:
# Create two more rows

df = pd.DataFrame({'James': ['Not good', 'Meh'], 'Emily': ['My grandma can cook better', 'Pretty average']})
df

Unnamed: 0,James,Emily
0,Not good,My grandma can cook better
1,Meh,Pretty average


## Series

There are two core objects in pandas, one is dataframe which we have already gone through, the other is called a series.

Dataframe, as we have seen, looks like a data table. A series on the other hand is a sequence of data values or sometimes called a list.

In [19]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

You can think of series as being a single column within a dataframe and so we can assign a index label to a series just like how we would with a dataframe.

In [20]:
profit = pd.Series([75, 80, 66], index = ['2018 Profit', '2019 Profit', '2020 Profit'])
profit  # iris,hive,cancer,tips

2018 Profit    75
2019 Profit    80
2020 Profit    66
dtype: int64

Using this same logic, we can form a dataframe using a list of list i.e. a combination of series. Let's see how we can do that.

In [21]:
customer_sales = pd.DataFrame([[317, 'Melbourne', '80'], [887, 'New York', '91'], [225, 'London', '50']], columns = ['Customer_ID', 'City', 'Sales'])
customer_sales

Unnamed: 0,Customer_ID,City,Sales
0,317,Melbourne,80
1,887,New York,91
2,225,London,50


Unlike before when we were creating our dataframe by column, when creating a dataframe using a series, a single list corresponds to a single row in the dataframe.

In [22]:
d = [[317, 'Melbourne', '80'], [887, 'New York', '91'], [225, 'London', '50']]

In [23]:
d[0]

[317, 'Melbourne', '80']

In [24]:
d[2]

[225, 'London', '50']