# Introduction


Pandas and NumPy are two of the most popular Python libraries for data manipulation and analysis. Pandas provides a wide range of tools for working with structured data, while NumPy is focused on scientific computing and numerical analysis. In this notebook, we will review some key features and functions of both libraries.

## Importing Libraries
Before we get started, let's import the necessary libraries:

In [1]:
import numpy as np
import pandas as pd

---
## NumPy
NumPy is a Python library for working with large, multi-dimensional arrays and matrices. It provides a variety of useful functions for performing mathematical operations on these arrays. Let's review some key features of NumPy:

### Creating Arrays
NumPy arrays can be created in several ways. The most common way is to use the **`array()`** function:

In [2]:
# Create one-dimentional array
arr = np.array([1, 2, 3, 4, 5])
print(arr)

[1 2 3 4 5]


In [3]:
#Creating a multi-dimensional array from a list of lists
arr2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [4]:
#Creating an array of zeros or ones
zeros = np.zeros((3, 4)) #shape/size of the array
print(zeros)
print('---------------')
ones = np.ones((2, 5))
print(ones)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
---------------
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]


In [5]:
#Creating an array with a range of values
arr3 = np.arange(5, 15, 2)
print(arr3)

[ 5  7  9 11 13]


In [6]:
#Creating an array with random values
# np.random.seed(24) 
rand1 = np.random.rand(3, 4)
print(rand1)

[[0.63960876 0.50075754 0.39173056 0.90445507]
 [0.51990922 0.8498987  0.81228718 0.45526993]
 [0.70013009 0.93396012 0.14195667 0.78748512]]


### Array Attributes
Once we have created a NumPy array, we can access various attributes of the array using the following syntax:
* **`arr.shape`**: shape of the array (number of rows, number of columns).
* **`arr.ndim`**: number of dimensions of the array.
* **`arr.size`**: total number of elements in the array.
* **`arr.dtype`**: data type of the elements in the array.

### Array Operations
NumPy provides a wide range of mathematical functions that can be performed on arrays.

#### Element-wise operations

In [7]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr3 = arr1 + arr2     # element-wise addition
arr4 = arr1 - arr2     # element-wise subtraction
arr5 = arr1 * arr2     # element-wise multiplication
arr6 = arr1 / arr2     # element-wise division
arr7 = arr1 ** arr2    # element-wise exponentiation

print('arr3: ', arr3)
print('arr4: ', arr4)
print('arr5: ', arr5)
print('arr6: ', arr6)
print('arr7: ', arr7)

arr3:  [5 7 9]
arr4:  [-3 -3 -3]
arr5:  [ 4 10 18]
arr6:  [0.25 0.4  0.5 ]
arr7:  [  1  32 729]


#### Matrix operations

In [8]:
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr3 = np.dot(arr1, arr2)     # matrix multiplication 
arr4 = np.transpose(arr1)    # matrix transpose
print('arr3: \n', arr3)
print('arr4: \n', arr4)

arr3: 
 [[19 22]
 [43 50]]
arr4: 
 [[1 3]
 [2 4]]


#### Mathematical functions

In [9]:
arr1 = np.array([1, 2, 3])
arr2 = np.sin(arr1)       # sine function
arr3 = np.cos(arr1)       # cosine function
arr4 = np.exp(arr1)       # exponential function
arr5 = np.sqrt(arr1)      # square root function
arr6 = np.log(arr1)       # natural logarithm function
arr7 = np.sum(arr1)       # sum of all elements in the array
arr8 = np.mean(arr1)      # mean of all elements in the array
arr9 = np.std(arr1)       # standard deviation of all elements in the array

print('sin: ', arr2)
print('cos: ', arr3)
print('exp: ', arr4)
print('square root: ', arr5)
print('log: ', arr6)
print('sum: ', arr7)
print('mean: ', arr8)
print('std: ', arr9)

sin:  [0.84147098 0.90929743 0.14112001]
cos:  [ 0.54030231 -0.41614684 -0.9899925 ]
exp:  [ 2.71828183  7.3890561  20.08553692]
square root:  [1.         1.41421356 1.73205081]
log:  [0.         0.69314718 1.09861229]
sum:  6
mean:  2.0
std:  0.816496580927726


### Indexing and Slicing
NumPy arrays can be indexed and sliced in the same way as Python lists:

In [10]:
arr = np.array([1, 2, 3, 4, 5])
print(arr[0])   
print(arr[1:3])

1
[2 3]


### Array Shape and Size
NumPy arrays have several attributes that describe their shape and size:

In [11]:
arr = np.array([[1, 2, 3], 
                [4, 5, 6]])
print(arr.shape)
print(arr.size) 

(2, 3)
6


---
## Pandas
Pandas is a Python library for data manipulation and analysis. It provides a wide range of functions for working with structured data, including reading and writing data from various file formats, cleaning and transforming data, and performing statistical analysis. Let's review some key features of Pandas:

### Series
A Pandas Series is a one-dimensional labeled array that can hold any data type, such as integers, floats, strings, or Python objects. It is similar to a column in a spreadsheet or a SQL table.

In [14]:
s = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print(s)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64


We can also specify custom labels for the index of the Series:

In [15]:
s = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
print(s)

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64


### DataFrame
A Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.

The following example creates a DataFrame from a Python dictionary:

In [16]:
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'city': ['New York', 'Paris', 'London', 'Berlin']}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,city
0,Alice,25,New York
1,Bob,32,Paris
2,Charlie,18,London
3,David,47,Berlin


In [17]:
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])
df

Unnamed: 0,name,age,city
a,Alice,25,New York
b,Bob,32,Paris
c,Charlie,18,London
d,David,47,Berlin


### Reading Data
Pandas can read data from various file formats such as CSV, Excel, SQL databases, and more.

In [18]:
df = pd.read_csv('drinks.csv')
df.head() 

Unnamed: 0,country,Soft_Drinks_servings,Grape_Juice_servings,Apple_Juice_servings,Orange_Juice_servings,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


### Data Cleaning

#### Handling Missing Values
Missing values are a common problem in data analysis. Pandas provides several methods to handle missing values, including:

* **`isna()/ isnull()`**: returns a boolean mask indicating missing values.
* **`fillna()`**: fills in missing values with a specified value or method.
* **`dropna()`**: drops rows or columns with missing values.

In [17]:
# Create a sample dataset
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', np.nan],
        'age': [25, np.nan, 35, np.nan, 29, 21],
        'gender': ['F', 'M', np.nan, 'M', 'F', 'F']}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,gender
0,Alice,25.0,F
1,Bob,,M
2,Charlie,35.0,
3,David,,M
4,Emily,29.0,F
5,,21.0,F


In [18]:
# Check for missing values
df.isna() #df.isnull()

Unnamed: 0,name,age,gender
0,False,False,False
1,False,True,False
2,False,False,True
3,False,True,False
4,False,False,False
5,True,False,False


In [19]:
df.isna().sum()  #.sort_values()

name      1
age       2
gender    1
dtype: int64

In [20]:
# Fill in missing values with a specified value or method - Forward Fill
df['age'].fillna(method='ffill', inplace=True)
df['age']

0    25.0
1    25.0
2    35.0
3    35.0
4    29.0
5    21.0
Name: age, dtype: float64

In [21]:
# Fill in missing values with a specified value or method - Backward Fill
df['gender'].fillna(method='bfill', inplace=True)
df['gender']

0    F
1    M
2    M
3    M
4    F
5    F
Name: gender, dtype: object

In [22]:
# Drop rows or columns with missing values
df.dropna(inplace=True) #how: ['any', 'all']

In [23]:
df

Unnamed: 0,name,age,gender
0,Alice,25.0,F
1,Bob,25.0,M
2,Charlie,35.0,M
3,David,35.0,M
4,Emily,29.0,F


#### Handling Duplicates
Duplicates can skew data analysis and lead to incorrect conclusions. Pandas provides several methods to handle duplicates, including:

* **`duplicated()`**: returns a boolean mask indicating duplicate rows.
* **`drop_duplicates()`**: drops duplicate rows.

In [19]:
# Create a sample dataset with duplicate rows
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Alice'],
        'age': [25, 30, 35, 40, 29, 25],
        'gender': ['F', 'M', 'M', 'M', 'F', 'F']}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,gender
0,Alice,25,F
1,Bob,30,M
2,Charlie,35,M
3,David,40,M
4,Emily,29,F
5,Alice,25,F


In [20]:
print(df.duplicated())

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool


In [21]:
df.duplicated().sum()

1

In [22]:
# Drop duplicate rows
df.drop_duplicates(inplace=True) #arg: keep -> Determines which duplicates (if any) to keep.

In [23]:
df.duplicated().sum()

0

### Data Transformation

#### Renaming Columns
Pandas provides a simple method to rename columns of a data frame, which is **`rename()`**. Here's how to use it:

In [28]:
# Create a sample dataset
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'age': [25, 30, 35, 40, 29],
        'gender': ['F', 'M', 'M', 'M', 'F']}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,gender
0,Alice,25,F
1,Bob,30,M
2,Charlie,35,M
3,David,40,M
4,Emily,29,F


In [29]:
# Rename columns
df.rename(columns={'name': 'Name', 'age': 'Age', 'gender': 'Gender'}, inplace=True)
df

Unnamed: 0,Name,Age,Gender
0,Alice,25,F
1,Bob,30,M
2,Charlie,35,M
3,David,40,M
4,Emily,29,F


#### Sorting Data
Sorting data can be important for better data analysis. Pandas provides a method **`sort_values()`** to sort data frames by one or more columns. Here's how to use it:

In [30]:
# Create a sample dataframe
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'age': [25, 30, 20, 35, 27],
    'salary': [50000, 70000, 40000, 90000, 60000]}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,salary
0,Alice,25,50000
1,Bob,30,70000
2,Charlie,20,40000
3,David,35,90000
4,Emma,27,60000


In [31]:
# Sorting data
df.sort_values(by=['age', 'salary'], ascending=[False, True], inplace=True)
df

Unnamed: 0,name,age,salary
3,David,35,90000
1,Bob,30,70000
4,Emma,27,60000
0,Alice,25,50000
2,Charlie,20,40000


#### Filtering Data
Filtering Data is selecting a subset of data from a larger dataset based on specific conditions or criteria. In other words, filtering allows us to extract the rows of data that meet certain requirements, while leaving out the rows that do not.

It provides several ways to filter data, including boolean indexing, loc, and iloc methods.
* **boolean indexing**: use conditional statements to create a boolean mask that filters the data.
* **`loc()`**: allows us to select rows based on a specific label or boolean array.
* **`iloc()`**: allows us to select rows based on integer positions.

In [24]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Ed', 'Frank', 'Grace', 'Helen'],
        'Age': [25, 32, 18, 47, 23, 37, 29, 31],
        'Gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'F'],
        'Salary': [50000, 75000, 30000, 90000, 35000, 100000, 60000, 65000]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Gender,Salary
0,Alice,25,F,50000
1,Bob,32,M,75000
2,Charlie,18,F,30000
3,Dave,47,M,90000
4,Ed,23,F,35000
5,Frank,37,M,100000
6,Grace,29,F,60000
7,Helen,31,F,65000


##### boolean indexing
Suppose we want to filter the data to only include rows where the salary is greater than or equal to 60,000. We can create a boolean mask using the following:

In [25]:
boolean_mask = df['Salary'] >= 60000
print(boolean_mask)

0    False
1     True
2    False
3     True
4    False
5     True
6     True
7     True
Name: Salary, dtype: bool


We can use this boolean mask to filter the data using the following code:

In [26]:
filtered_data = df[boolean_mask]
filtered_data

Unnamed: 0,Name,Age,Gender,Salary
1,Bob,32,M,75000
3,Dave,47,M,90000
5,Frank,37,M,100000
6,Grace,29,F,60000
7,Helen,31,F,65000


##### loc
Suppose we want to filter the data to only include rows where the Age column is greater than 30.

In [35]:
df_filtered = df.loc[df['Age'] > 30]
df_filtered

Unnamed: 0,Name,Age,Gender,Salary
1,Bob,32,M,75000
3,Dave,47,M,90000
5,Frank,37,M,100000
7,Helen,31,F,65000


We can also combine multiple conditions using the **`&`** and **`|`** operators:

In [36]:
df.loc[(df['Age'] > 30) & (df['Gender'] == 'M')]

Unnamed: 0,Name,Age,Gender,Salary
1,Bob,32,M,75000
3,Dave,47,M,90000
5,Frank,37,M,100000


##### iloc
Suppose we want to select the first three rows.

In [37]:
df_filtered = df.iloc[:3]
df_filtered

Unnamed: 0,Name,Age,Gender,Salary
0,Alice,25,F,50000
1,Bob,32,M,75000
2,Charlie,18,F,30000


Suppose we want to select the first three rows and only the first two columns.

In [38]:
df_filtered = df.iloc[:3, :2]
df_filtered

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,32
2,Charlie,18


#### Grouping Data

In [39]:
grouped = df.groupby('Gender')
grouped.mean()

Unnamed: 0_level_0,Age,Salary
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,25.2,48000.0
M,38.666667,88333.333333
