### Pandas

Pandas is a common Python library which provides a wide range of functionalities for data manipulation and data analysis.

It is mostly used in data analytics, data science, machine learning and many more.

Pandas has two primary data structures known as **series** and **DataFrame**.

### Installation of Pandas

The numpy library can be installed using the following:


**1. Command Prompt**

- Launch the command prompt
- Type the code below to install the pandas library

`pip install pandas`



**2. Anaconda Terminal**

- Launch the anaconda terminal
- Type the code below to install the pandas library

`conda install pandas`


### Using Pandas

To use the pandas library import the pandas library using:

`import pandas as pd`

The **pd** is the popular alias for the pandas library.

In [None]:
import pandas as pd

### Series

A series is a 1D array that can hold data of any type such as integers, float etc

A pandas Series can be created from a list, numpy array, dictionary, from a file or database and many more

The structure for creating a Series is :

`pd.Series(data)`

#### A Series from a list

In [None]:
num_1 = [1,2,3,4,5]
num_1_series = pd.Series(num_1,name="numero")


In [None]:
num_1_series

In [None]:
type(num_1_series)

In [None]:
num_1_series.shape

In [None]:
num_1_series.name

In [None]:
num_1_series.index

In [None]:
len(num_1_series)

In [None]:
#Renaming the index of num_1_series
num_1_series.index = ["a","b","c","d","e"]


In [None]:
num_1_series

In [None]:
num_1_series.index

#### A Series from a dictionary

In [None]:
# The keys of the dictionary serves as the index of the Series. 
# The values of the dictionary serves as the values of the Series. 

regions_capitals ={"Greater Accra":"Accra","Ashanti":"Kumasi", "Central":"Cape Coast"}

regions_capitals

In [None]:
regions_capitals_series=pd.Series(regions_capitals, name="regions")
regions_capitals_series

In [None]:
regions_capitals_series.index

In [None]:
regions_capitals_series.values

In [None]:
ages ={"age":[12,15,22,33,55,22], "number":[1,2,3,4,5,6]}
ages_series = pd.Series(ages, name="ages")
ages_series

In [None]:
ages_series.index

In [None]:
ages_series.values

#### A Series from a numpy array

In [None]:
import numpy as np
import pandas as pd

In [None]:
years = np.array([2019,2020,2021,2022,2023])

In [None]:
years_series = pd.Series(years, index=[1,2,3,4,5], name="years" )

In [None]:
years_series.name

In [None]:
years_series.index

In [None]:
years_series

#### A series from a scalar

In [None]:
scalar_series =pd.Series(12, index=list(range(1,9)))


In [None]:
scalar_series.index

In [None]:
scalar_series

### Series Indexing and Slicing

In [None]:
countries = ["Ghana", "Nigeria", "Togo","Benin", "Niger"]
countries_series =pd.Series(countries, name= "countries", index= list(range(1,6)))

In [None]:
countries_series

In [None]:
countries_series[1]

In [None]:
countries_series[5]

In [None]:
countries_series[2:4]

In [None]:
countries_series[3:5]

In [None]:
countries_series[3:]

The `get` method is used to retrieve a value from a  series based on the specific index.

The get method returns None if the specified index is not found in the series.

In [None]:
#Using the get method for indexing
countries_series.get(4)

In [None]:
countries_series.get(2)

In [None]:
#returns none
countries_series.get(6)

#### iloc method

The iloc method can be used to index a series.

The iloc method takes an integer index as its argument, starting from 0 for the first element, and to access the element at that index.

The iloc method can be used to index a range of elements in the series.

In [None]:
countries_series

In [None]:
countries_series.iloc[2]

In [None]:
# This error will throw an error
countries_series.iloc[4]

**loc method**

The loc  method is used to select data from a series based on label based indexing i.e data rows are selected based on the label of the index.


In [None]:
countries_series

In [None]:
countries_series.loc[1]

In [None]:
countries_series.iloc[1]

In [None]:
countries_series[1]

In [None]:
#Renaming the index of the series
countries_series.index = ["a","b","c","d","e"]
countries_series

In [None]:
countries_series.loc["a"]

In [None]:
countries_series.loc["d"]

In [None]:
countries_series.loc["a":"d"]

In [None]:
countries_series.loc["c":"e"]

In [None]:
# select multiple values using loc
countries_series.loc[["a","c","e"]]

### DataFrame

A DataFrame is a 2D table-like data structure used to store and manipulate tabular data.

It consists of rows and columns where each column can be of different data type

Dataframes can be created from Series, list, numpy arrays and dictionaries using the format below

`pandas.DataFrame(data, index, columns, dtype, copy,...)`


In [None]:
# DataFrame from Dictionary of Series
data_1= {'Name' : pd.Series(['Selasi', 'Frank', 'Precious','Richmond'], index=[1,2,3,4]),
       
   'Rate' : pd.Series(['4', '3', '5', '2'], index=[1,2,3,4])}

data_1

In [None]:
rating_df = pd.DataFrame(data_1)
rating_df

In [None]:
# DataFrame from Dictionary
#The keys of the dictionary becomes the columns of the DataFrame
students = {'name': ['Jude', 'Bob', 'Jael'], 'age': [40, 30, 35], 'gender': ['M', 'M', 'F']}
students_df = pd.DataFrame(students)
students_df

In [None]:
#DataFrame from list of Lists
data = [['Alice', 25, 'F'], ['Mary', 30, 'F'], ['Sedem', 35, 'M']]
data_df = pd.DataFrame(data,columns=['name', 'age', 'gender'])
data_df

#### Indexing and Slicing

In [None]:
data_df

In [None]:
data_df.shape

In [None]:
type(data_df)

In [None]:
data_df.ndim

In [None]:
data_df.index

In [None]:
data_columns = data_df.columns
data_columns

In [None]:
data_df

In [None]:
#Selecting a single column
age_col = data_df["age"]
age_col

In [None]:
# the type of age_col is a Series
type(age_col)

In [None]:
#Selecting a single column
age_col2 = data_df[["age"]]
age_col2

In [None]:
# the type of age_col is a data Frame
type(age_col2)

In [None]:
data_df

In [None]:
#Selecting multiple columns

data_df[["name","age"]]

### loc method

The loc method is used to select rows and columns from a Dataframe based on the row index and the column names.

In [None]:
data_df.loc[2,"name"]

In [None]:
# Selecting all rows for a single column
data_df.loc[:,"name"]

In [None]:
# Selecting all columns for a single row
data_df.loc[1,:]

In [None]:
# Select multiple rows
data_df.loc[:,["name","age"]]

In [None]:
# Select a subset of rows and columns
data_df.loc[1:3,["name","age"]]

### iloc

It is used to select data by integer position from a DataFrame.

In [None]:
data_df

In [None]:
#Accessing a single row
data_df.iloc[0]

In [None]:
data_df.iloc[0,1]

In [None]:
data_df.iloc[:,1]

In [None]:
data_df.iloc[0:2,0:1]

### Basic Statistics Using Pandas

In [None]:
my_data = pd.DataFrame({'price': [200, 100, 50, 300, 600], 'profit': [20, 60, 75, 32, 20],'City':['Accra','Tema','Kumasi','Suyani','Koforidua']})
my_data

In [None]:
#Sum of profit
my_data["profit"].sum()

In [None]:
#Minimum of Profit
my_data["profit"].min()

In [None]:
#Mmximum of Profit
my_data["profit"].max()

In [None]:
#Sum of profit and sum of price
my_data[["profit","price"]].sum()

In [None]:
#Mean of profit and sum of price
my_data[["profit","price"]].mean()

In [None]:
#Standard Deviation of profit and sum of price
my_data[["profit","price"]].std()

In [None]:
#Brief statistics of the dataframe
my_data.describe()

In [None]:
#Brief info of the dataframe
my_data.info()

In [None]:
#the number of non null-values
my_data.count()

### Read CSV Files  

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

In [1]:
###Load the CSV into a DataFrame:

import pandas as pd

df = pd.read_csv("data.csv")

#print(df.to_string()) 

### Viewing the Data  

One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.

In [7]:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))


   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0


There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [11]:
print(df.tail(20)) 

     Duration  Pulse  Maxpulse  Calories
149        60    110       150     409.4
150        60    106       134     343.0
151        60    109       129     353.2
152        60    109       138     374.0
153        30    150       167     275.8
154        60    105       128     328.0
155        60    111       151     368.5
156        60     97       131     270.4
157        60    100       120     270.4
158        60    114       150     382.8
159        30     80       120     240.9
160        30     85       120     250.4
161        45     90       130     260.4
162        45     95       130     270.0
163        45    100       140     280.9
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


### Info About the Data  

The DataFrames object has a method called info(), that gives you more information about the data set.

In [12]:
print(df.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


### Null Values  

The info() method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are 164 of 169 Non-Null values in the "Calories" column.

Which means that there are 5 rows with no value at all, in the "Calories" column, for whatever reason.

Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data, and you will learn more about that in the next chapters.

