# Pandas 
## Introduction: 
Pandas includes includes data structures and data manipulation tools meant to make data cleaning and analysis in Python quick and straightforward. It is frequently used in conjunction with numerical computing libraries such as NumPy and SciPy, analytical libraries such as statsmodels and scikit-learn, and data visualization libraries such as matplotlib. In this jupyter notebook, we will aim to cover the fundamentals of what the Pandas library can do for you.

## Contents
<ol>
    <li>Series</li>
    <li>Dataframe</li>
</ol>

## Imports 

In [1]:
import pandas as pd 

## 1. Series 
A Pandas Series is a one-dimensional labeled array that can hold data of any type (integer, float, string, etc.).

In [2]:
# Create a Series object
serie= pd.Series([4, 3, 5, 7])
serie

0    4
1    3
2    5
3    7
dtype: int64

In [3]:
# It is a set of indexed values
# Values
print('Values:', serie.values)

# indexes
print('Indexes:', serie.index)

Values: [4 3 5 7]
Indexes: RangeIndex(start=0, stop=4, step=1)


In the previous example, we fed in the data [4, 3, 5, 7], and the index range was set to the default. But, if we want, we may set the indexes:

In [4]:
series1= pd.Series([4, 3, 5, 7], index=['A', 'B', 'C', 'D'])
series1

A    4
B    3
C    5
D    7
dtype: int64

In [5]:
# Extract a value by index
series1['A']

4

In [6]:
# Set a value by index
series1['A']= 10
series1

A    10
B     3
C     5
D     7
dtype: int64

In [7]:
# Extract a setof values by indexes
series1[['A', 'D']]

A    10
D     7
dtype: int64

In [8]:
# Multiply values by a number 
series1 * 2

A    20
B     6
C    10
D    14
dtype: int64

In [9]:
# Check if an index exists in the Series object 
print('A' in series1)
print('G' in series1)

True
False


In [10]:
# Initialise a series object using a dictionnary 
# Set a dictionnary 
data= {'Volvo': 3, 'Mercedes': 6, 'Renault': 67}

# Create the Series object
series2= pd.Series(data)

# Show series content 
series2

Volvo        3
Mercedes     6
Renault     67
dtype: int64

In [11]:
# Check if a series contain a Nan value using isnull() function
# isnull() returns True if the value is Nan, otherwise False.  
series2.isna()

Volvo       False
Mercedes    False
Renault     False
dtype: bool

In [12]:
# Filtering pandas Series
# Filter values that are superior then 2
series2[series2> 2]

Volvo        3
Mercedes     6
Renault     67
dtype: int64

In [13]:
# Filter by a condition
series2[series2 % 2 == 0]

Mercedes    6
dtype: int64

## 2. Dataframes
A dataframe is a two-dimensional table-like data structure that is used to structurely store and modify data. It is a fundamental data structure in the pandas library, and it is frequently used in data analysis and manipulation. Let's start with creating one:

### 1. General

In [14]:
# Create a dataframe with three columns: Player, nation, number of World Cup trophies 
data= {'player': ['Ronaldo', 'Messi', 'Pele', 'Maradonna', 'Ronaldinho', 'Beckham'],
       'nation': ['Portugal', 'Argentina', 'Brazil', 'Argentina', 'Brazil', 'England'],
       'world cup': [0, 1, 3, 1, 1, 0]}

df= pd.DataFrame(data)

# diplay dataframe
df

Unnamed: 0,player,nation,world cup
0,Ronaldo,Portugal,0
1,Messi,Argentina,1
2,Pele,Brazil,3
3,Maradonna,Argentina,1
4,Ronaldinho,Brazil,1
5,Beckham,England,0


In [15]:
# columns 
df.columns

Index(['player', 'nation', 'world cup'], dtype='object')

In [16]:
# To display the first 5 rows
df.head()

Unnamed: 0,player,nation,world cup
0,Ronaldo,Portugal,0
1,Messi,Argentina,1
2,Pele,Brazil,3
3,Maradonna,Argentina,1
4,Ronaldinho,Brazil,1


In [17]:
# To display the last 5 rows
df.tail()

Unnamed: 0,player,nation,world cup
1,Messi,Argentina,1
2,Pele,Brazil,3
3,Maradonna,Argentina,1
4,Ronaldinho,Brazil,1
5,Beckham,England,0


In [18]:
# shape of dataframe 
df.shape # 6 rows and 3 columns

(6, 3)

In [19]:
# length of dataframe or number of rows
len(df)

6

In [20]:
# Rtrieve a dataframe column as a Series object
df['player'] # or df.player

0       Ronaldo
1         Messi
2          Pele
3     Maradonna
4    Ronaldinho
5       Beckham
Name: player, dtype: object

### 2. Filtering and selection with loc and iloc

In [21]:
# Filter by conditions using loc
# Single condition
df.loc[df['world cup'] > 1]

Unnamed: 0,player,nation,world cup
2,Pele,Brazil,3


In [22]:
# two conditions or more 
# And operator: &
# Or operator: | 
df.loc[ (df['world cup'] >= 1) & (df['nation'] == 'Argentina') ]

Unnamed: 0,player,nation,world cup
1,Messi,Argentina,1
3,Maradonna,Argentina,1


In [23]:
# Add a new column to the dataframe
goals= list(range(6))
df['goals']= goals
df

Unnamed: 0,player,nation,world cup,goals
0,Ronaldo,Portugal,0,0
1,Messi,Argentina,1,1
2,Pele,Brazil,3,2
3,Maradonna,Argentina,1,3
4,Ronaldinho,Brazil,1,4
5,Beckham,England,0,5


In [24]:
# Dropping columns 
df.drop(columns=['goals'], inplace= True)
df

Unnamed: 0,player,nation,world cup
0,Ronaldo,Portugal,0
1,Messi,Argentina,1
2,Pele,Brazil,3
3,Maradonna,Argentina,1
4,Ronaldinho,Brazil,1
5,Beckham,England,0


In [25]:
# Displayong duplicates
df.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [64]:
# count duplicated values 
df.duplicated().value_counts()

# count nan values
df.isna().value_counts()

name   age    salary  rank 
False  False  False   False    4
dtype: int64

In [26]:
# Retrieve rows and columns using iloc
# retrieve the 3rd row
df.iloc[2]

player         Pele
nation       Brazil
world cup         3
Name: 2, dtype: object

In [27]:
# retrieve first 4 rows
df.iloc[0:4]

Unnamed: 0,player,nation,world cup
0,Ronaldo,Portugal,0
1,Messi,Argentina,1
2,Pele,Brazil,3
3,Maradonna,Argentina,1


In [28]:
# retrieve a value iloc[row, column]
df.iloc[1,1]

'Argentina'

### 3. Operations between a series and a dataframe

In [29]:
rankings= pd.Series([2, 1, 10, 5, 4, 8])
df['rankings']= rankings
df

Unnamed: 0,player,nation,world cup,rankings
0,Ronaldo,Portugal,0,2
1,Messi,Argentina,1,1
2,Pele,Brazil,3,10
3,Maradonna,Argentina,1,5
4,Ronaldinho,Brazil,1,4
5,Beckham,England,0,8


In [30]:
# Multiply two columns 
df['rankings']= df['world cup'] * df['rankings']
df

Unnamed: 0,player,nation,world cup,rankings
0,Ronaldo,Portugal,0,0
1,Messi,Argentina,1,1
2,Pele,Brazil,3,30
3,Maradonna,Argentina,1,5
4,Ronaldinho,Brazil,1,4
5,Beckham,England,0,0


In [31]:
# Divide rankings column by world cup 
df['ratio']= df['rankings'] / df['world cup']
df

Unnamed: 0,player,nation,world cup,rankings,ratio
0,Ronaldo,Portugal,0,0,
1,Messi,Argentina,1,1,1.0
2,Pele,Brazil,3,30,10.0
3,Maradonna,Argentina,1,5,5.0
4,Ronaldinho,Brazil,1,4,4.0
5,Beckham,England,0,0,


In [34]:
# Sum two columns
df['sum']= df['rankings'] + df['world cup']
df

Unnamed: 0,player,nation,world cup,rankings,ratio,sum
0,Ronaldo,Portugal,0,0,,0
1,Messi,Argentina,1,1,1.0,2
2,Pele,Brazil,3,30,10.0,33
3,Maradonna,Argentina,1,5,5.0,6
4,Ronaldinho,Brazil,1,4,4.0,5
5,Beckham,England,0,0,,0


### 4. Sorting and ranking

In [38]:
# sort a dataframe by a column 
df = pd.DataFrame({'name': ['John', 'Amy', 'Bob', 'Alex'],
                   'age': [25, 28, 22, 30],
                   'salary': [50000, 60000, 45000, 70000]})


df1= df.sort_values(by= 'salary', ascending= False)
df1

Unnamed: 0,name,age,salary
3,Alex,30,70000
1,Amy,28,60000
0,John,25,50000
2,Bob,22,45000


In [46]:
# ranking rows of a dataframe 
df.rank()

# ranking rows based on one column
df['rank']= df['salary'].rank(method= 'min', ascending= False) # method can be max, min or average
df

Unnamed: 0,name,age,salary,rank
0,John,25,50000,3.0
1,Amy,28,60000,2.0
2,Bob,22,45000,4.0
3,Alex,30,70000,1.0


### 5. Statistically describing a dataframe 

###### A. General overview

In [53]:
df.describe()

Unnamed: 0,age,salary,rank
count,4.0,4.0,4.0
mean,26.25,56250.0,2.5
std,3.5,11086.778913,1.290994
min,22.0,45000.0,1.0
25%,24.25,48750.0,1.75
50%,26.5,55000.0,2.5
75%,28.5,62500.0,3.25
max,30.0,70000.0,4.0


###### B. By column

In [54]:
# Mean
df['salary'].mean()

56250.0

In [55]:
# Median
df['salary'].median()

55000.0

In [56]:
# standard deviation: calculates the amount of variation of a dataset or column
df['salary'].std()

11086.778913041726

In [57]:
# variance: quantify the dispersion of a dataset
df['salary'].var()

122916666.66666667

In [58]:
# minimun
df['salary'].min()

45000

In [59]:
# maximum
df['salary'].max()

70000

In [60]:
# summ all values of a column
df['salary'].sum()

225000