### Pandas

Pandas is a common Python library which provides a wide range of functionalities for data manipulation and data analysis.

It is mostly used in data analytics, data science, machine learning and many more.

Pandas has two primary data structures known as **series** and **DataFrame**.

### Installation of Pandas

The numpy library can be installed using the following:


**1. Command Prompt**

- Launch the command prompt
- Type the code below to install the pandas library

`pip install pandas`



**2. Anaconda Terminal**

- Launch the anaconda terminal
- Type the code below to install the pandas library

`conda install pandas`


### Using Numpy

To use the pandas library import the pandas library using:

`import pandas as pd`

The **pd** is the popular alias for the pandas library.

In [1]:
import pandas as pd

### Series

A series is a 1D array that can hold data of any type such as integers, float etc

A pandas Series can be created from a list, numpy array, dictionary, from a file or database and many more

The structre for creating a Deries is :

`pd.Series(data)`

#### A Series from a list

In [2]:
num_1 = [1,2,3,4,5]
num_1_series =pd.Series(num_1, name="numbers")
num_1_series

0    1
1    2
2    3
3    4
4    5
Name: numbers, dtype: int64

In [3]:
type(num_1_series)

pandas.core.series.Series

In [4]:
num_1_series.shape

(5,)

In [5]:
num_1_series.name

'numbers'

In [6]:
num_1_series.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
len(num_1_series)

5

In [8]:
#Renaming the index of num_1_series
num_1_series.index = ["a","b","c","d","e"]

In [9]:
num_1_series

a    1
b    2
c    3
d    4
e    5
Name: numbers, dtype: int64

In [10]:
num_1_series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

#### A Series from a dictionary

In [11]:
# The keys of the dictionary serves as the index of the Series. 
# The values of the dictionary serves as the values of the Series. 

regions_capitals ={"Greater Accra":"Accra","Ashanti":"Kumasi", "Central":"Cape Coast"}
regions_capitals
regions_capitals_series=pd.Series(regions_capitals, name="regions")
regions_capitals_series

Greater Accra         Accra
Ashanti              Kumasi
Central          Cape Coast
Name: regions, dtype: object

In [12]:
regions_capitals_series.index

Index(['Greater Accra', 'Ashanti', 'Central'], dtype='object')

In [13]:
regions_capitals_series.values

array(['Accra', 'Kumasi', 'Cape Coast'], dtype=object)

In [14]:
ages ={"age":[12,15,22,33,55,22], "number":[1,2,3,4,5,6]}
ages_series = pd.Series(ages, name="ages")
ages_series

age       [12, 15, 22, 33, 55, 22]
number          [1, 2, 3, 4, 5, 6]
Name: ages, dtype: object

#### A Series from a numpy array

In [15]:
import numpy as np

In [16]:
years = np.array([2019,2020,2021,2022,2023])

years_series = pd.Series(years, index=[1,2,3,4,5], name="years" )

In [17]:
years_series.name

'years'

In [18]:
years_series

1    2019
2    2020
3    2021
4    2022
5    2023
Name: years, dtype: int32

#### A series from a scalar

In [19]:
scalar_series =pd.Series(12, index=list(range(1,6)))
scalar_series

1    12
2    12
3    12
4    12
5    12
dtype: int64

### Series Indexing and Slicing

In [20]:
countries = ["Ghana", "Nigeria", "Togo","Benin", "Niger"]
countries_series =pd.Series(countries, name= "countries", index= list(range(1,6)))

In [21]:
countries_series

1      Ghana
2    Nigeria
3       Togo
4      Benin
5      Niger
Name: countries, dtype: object

In [22]:
countries_series[1]

'Ghana'

In [23]:
countries_series[5]

'Niger'

In [24]:
countries_series[2:4]

3     Togo
4    Benin
Name: countries, dtype: object

In [25]:
countries_series[3:5]

4    Benin
5    Niger
Name: countries, dtype: object

In [26]:
countries_series[3:]

4    Benin
5    Niger
Name: countries, dtype: object

The `get` method is used to retrieve a value from a  series based on the specific index.

The get method returns None uf the specified index is not found in the series.

In [27]:
#Using the get method for indexing
countries_series.get(4)

'Benin'

In [28]:
countries_series.get(2)

'Nigeria'

In [29]:
#returns none
countries_series.get(6)

#### iloc method

The iloc method can be used to index a series.

The iloc method takes an integer index as its argument, starting from 0 for the first element, and to access the element at that index.

The iloc method can be used to index a range of elements in the series.

In [30]:
countries_series

1      Ghana
2    Nigeria
3       Togo
4      Benin
5      Niger
Name: countries, dtype: object

In [31]:
countries_series.iloc[1]

'Nigeria'

In [32]:
# This error will throw an error
# countries_series.iloc[5]

In [33]:
countries_series.loc[1]

'Ghana'

**loc method**

The loc  method id used to select data from a series based on label based indexing i.e data rows are selected based on the label of the index.


In [34]:
countries_series

1      Ghana
2    Nigeria
3       Togo
4      Benin
5      Niger
Name: countries, dtype: object

In [35]:
countries_series[1]

'Ghana'

In [36]:
#Renaming the index of the series
countries_series.index = ["a","b","c","d","e"]
countries_series

a      Ghana
b    Nigeria
c       Togo
d      Benin
e      Niger
Name: countries, dtype: object

In [37]:
countries_series.loc["a"]

'Ghana'

In [38]:
countries_series.loc["d"]

'Benin'

In [39]:
countries_series.loc["a":"d"]

a      Ghana
b    Nigeria
c       Togo
d      Benin
Name: countries, dtype: object

In [40]:
countries_series.loc["c":"e"]

c     Togo
d    Benin
e    Niger
Name: countries, dtype: object

In [41]:
# select multiple values using loc
countries_series.loc[["a","c","e"]]

a    Ghana
c     Togo
e    Niger
Name: countries, dtype: object

### DataFrame

A DataFrame is a 2D table-like data structure used to store and manipulate tabular data.

It consists of rows and columns where each column can be of different data type

Dataframes can be created from Series, list, numpy arrays and dictionaries using the format below

`pandas.DataFrame( data, index, columns, dtype, copy,...)`


In [42]:
# DataFrame from Dictionary of Series
data= {'Name' : pd.Series(['Selasi', 'Frank', 'Precious','Richmond'], index=[1,2,3,4]),
       
   'Rate' : pd.Series(['4', '3', '5', '2'], index=[1,2,3,4])}

rating_df = pd.DataFrame(data)
rating_df

Unnamed: 0,Name,Rate
1,Selasi,4
2,Frank,3
3,Precious,5
4,Richmond,2


In [43]:
# DataFrame from Dictionary
#The keys of the dictionary becomes the columns of the DataFrame
students = {'name': ['Jude', 'Bob', 'Jael'], 'age': [40, 30, 35], 'gender': ['M', 'M', 'F']}
students_df = pd.DataFrame(students)
students_df

Unnamed: 0,name,age,gender
0,Jude,40,M
1,Bob,30,M
2,Jael,35,F


In [44]:
#DataFrame from list of Lists
data = [['Alice', 25, 'F'], ['Mary', 30, 'F'], ['Sedem', 35, 'M']]
data_df = pd.DataFrame(data,columns=['name', 'age', 'gender'])
data_df

Unnamed: 0,name,age,gender
0,Alice,25,F
1,Mary,30,F
2,Sedem,35,M


#### Indexing and Slicing

In [45]:
data_df

Unnamed: 0,name,age,gender
0,Alice,25,F
1,Mary,30,F
2,Sedem,35,M


In [46]:
#Selecting a single column
age_col = data_df["age"]
age_col

0    25
1    30
2    35
Name: age, dtype: int64

In [47]:
# the type of age_col is a Series
type(age_col)

pandas.core.series.Series

In [48]:
#Selecting a single column
age_col2 = data_df[["age"]]
age_col2

Unnamed: 0,age
0,25
1,30
2,35


In [49]:
# the type of age_col is a data Frame
type(age_col2)

pandas.core.frame.DataFrame

In [50]:
data_df

Unnamed: 0,name,age,gender
0,Alice,25,F
1,Mary,30,F
2,Sedem,35,M


In [51]:
#Selecting multiple columns

data_df[["name", "age"]]

Unnamed: 0,name,age
0,Alice,25
1,Mary,30
2,Sedem,35


### loc method

The loc method is used to select rows and columns from a Dataframe based on the row index and the column names.

In [52]:
data_df.loc[1,"name"]

'Mary'

In [53]:
# Selecting all rows for a single column
data_df.loc[:,"name"]

0    Alice
1     Mary
2    Sedem
Name: name, dtype: object

In [54]:
# Selecting all columns for a single row
data_df.loc[1,:]

name      Mary
age         30
gender       F
Name: 1, dtype: object

In [55]:
# Select multiple rows
data_df.loc[:,["name","age"]]

Unnamed: 0,name,age
0,Alice,25
1,Mary,30
2,Sedem,35


In [56]:
# Select a subset of rows and columns
data_df.loc[1:3,["name","age"]]

Unnamed: 0,name,age
1,Mary,30
2,Sedem,35


### iloc

It is used to select data by integer position from a DataFrame.

In [57]:
data_df

Unnamed: 0,name,age,gender
0,Alice,25,F
1,Mary,30,F
2,Sedem,35,M


In [58]:
#Accessing a single row
data_df.iloc[0]

name      Alice
age          25
gender        F
Name: 0, dtype: object

In [59]:
data_df.iloc[0,1]

25

In [60]:
data_df.iloc[:,1]

0    25
1    30
2    35
Name: age, dtype: int64

In [64]:
data_df.iloc[0:2,0:1]

Unnamed: 0,name
0,Alice
1,Mary


### Basic Statistics Using Pandas

In [65]:
my_data = pd.DataFrame({'price': [200, 100, 50, 300, 600], 'profit': [20, 60, 75, 32, 20],'City':['Accra','Tema','Kumasi','Suyani','Koforidua']})
my_data

Unnamed: 0,price,profit,City
0,200,20,Accra
1,100,60,Tema
2,50,75,Kumasi
3,300,32,Suyani
4,600,20,Koforidua


In [66]:
#Sum of profit
my_data["profit"].sum()

207

In [67]:
#Minimum of Profit
my_data["profit"].min()

20

In [68]:
#Mmximum of Profit
my_data["profit"].max()

75

In [70]:
#Sum of profit and sum of price
my_data[["profit","price"]].sum()

profit     207
price     1250
dtype: int64

In [71]:
#Mean of profit and sum of price
my_data[["profit","price"]].mean()

profit     41.4
price     250.0
dtype: float64

In [72]:
#Standard Deviation of profit and sum of price
my_data[["profit","price"]].std()

profit     24.895783
price     217.944947
dtype: float64

In [73]:
#Brief statistics of the dataframe
my_data.describe()

Unnamed: 0,price,profit
count,5.0,5.0
mean,250.0,41.4
std,217.944947,24.895783
min,50.0,20.0
25%,100.0,20.0
50%,200.0,32.0
75%,300.0,60.0
max,600.0,75.0


In [74]:
#Brief info of the dataframe
my_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   price   5 non-null      int64 
 1   profit  5 non-null      int64 
 2   City    5 non-null      object
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes


In [76]:
#the number of non null-values
my_data.count()

price     5
profit    5
City      5
dtype: int64