# PART 1 Introduction to Pandas and Numpy

## Introduction to NumPy

#### What is NumPy?

NumPy is a Python library used for numerical computing. It supports large, multidimensional arrays and matrices and a collection of mathematical functions to
operate on these arrays efficiently.

#### Why is NumPy important in data science?**

NumPy is essential in data science due to its ability to handle large datasets
efficiently. It provides high-performance multidimensional arrays and tools for
working with these arrays, making it indispensable for tasks like data manipulation,
analysis, and machine learning.
    

In [2]:
# Installation and set up
# !pip install numpy

#### NumPy Arrays

In [6]:
#Creating NumPy arrays
import numpy as np

import array as arr

# Create an array from the list
arr1 = np.array([1,2,3,4,5])
print(arr1)

[1 2 3 4 5]


In [5]:
# create an array of zeros
arr2 = np.zeros(5)
print(arr2)

[0. 0. 0. 0. 0.]


In [7]:
# create an aray of ones
arr3 = np.ones((3,3))
print(arr3)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [8]:
# Create ana array od evenly spaced values
arr4 = np.arange(0,10,2)
print(arr4)

[0 2 4 6 8]


In [9]:
# Create ana aaray of random values
arr5 = np.random.rand(3,3)
print(arr5)

[[0.10263278 0.33229407 0.7180908 ]
 [0.1213187  0.30334688 0.27813358]
 [0.99625893 0.61429774 0.31878184]]


In [15]:
#Array attributes: shape, size, dtype
print(arr1.shape)
print(arr1.size)
print(arr1.dtype)

(5,)
5
int32


In [16]:
#Indexing and slicing arrays
# Accesing
print(arr1[0])

# Slicing
print(arr1[1:4])

1
[2 3 4]


In [11]:
# Array operations: arithmetic, aggregation, broadcating

# Arithmetic operations 
arr6 = arr1 + arr2
print(arr6)

# Aggregation functions
print(np.sum(arr1))

# Broadcasting
arr7 = arr1 * 2
print(arr7)

[1. 2. 3. 4. 5.]
15
[ 2  4  6  8 10]


#### Array Manipulation

In [12]:
# Reshaping arrays

# Reshape array
arr8 = np.arange(9).reshape(3,3)
print(arr8)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


In [13]:
#Stacking and splitting arrays
#Stacking arrays verticaly
arr9 = np.vstack((arr8, arr8))
print(arr9)


[[0 1 2]
 [3 4 5]
 [6 7 8]
 [0 1 2]
 [3 4 5]
 [6 7 8]]


In [14]:
#Splitting arrays
arr10, arr11 = np.split(arr9, 2)
print(arr10, arr11)

[[0 1 2]
 [3 4 5]
 [6 7 8]] [[0 1 2]
 [3 4 5]
 [6 7 8]]


In [27]:
# Transposing arrays
arr12 = arr8.T
print(arr12)

[[0 3 6]
 [1 4 7]
 [2 5 8]]


In [28]:
#Universal functions (ufuncs) 
arr13 = np.sin(arr1)
print(arr13)

[ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427]


### Advanced NumPy

In [30]:
#Fancy indexing - to access the multiple indexes in 1 go
indices = np.array([0,2,4])
print(arr1[indices])

[1 3 5]


In [36]:
# Boolean indexing
bool_arr = arr1 > 4
print(arr1[bool_arr])

[5]


In [37]:
# Vectorized operations
arr14 = arr1 + 10
print(arr14)

[11 12 13 14 15]


In [15]:
# Broadcasting
arr15 = arr1 + np.array([[10],[20],[30],[40],[50]])
print(arr15)

[[11 12 13 14 15]
 [21 22 23 24 25]
 [31 32 33 34 35]
 [41 42 43 44 45]
 [51 52 53 54 55]]


In [19]:
# to compute dot products of two given arrays
# dot product is an algebraic operation that takes two equal-length sequences of numbers
# and returns a single number.
A = np.array([[1,2], [3,4] ])
B = np.array([[5,6], [7,8] ])
C = np.dot(A, B)

print(C)

[[19 22]
 [43 50]]


In [18]:
# compute the inverse matrix
A_inv = np.linalg.inv(A)
print(A_inv)

[[-2.   1. ]
 [ 1.5 -0.5]]


# Part2: Introduction to Pandas
Pandas is a powerful data manipulation and analysis library built on top of NumPy. It
provides high-level data structures like Series and DataFrame designed to efficiently
handle structured data
#### Series and DataFrame objects
Pandas has two main data structures:
•Series: a one-dimensional array-like object that can hold any data type.
•DataFrame: a two-dimensional labeled data structure with columns of potentially different data types.
#### Importance
Pandas is essential in data analysis tasks as it simplifies many common data manipulation tasks such as
cleaning, transformation, and exploration. It provides intuitive and flexible tools for working with
structured data.


In [39]:
# !pip install pandas
import pandas as pd

#### Working with Series

In [None]:
#Creating Series objects
import pandas as pd

# Create Series from a list
s1 = pd.Series([1,2,3,4,5])

# Create Series from an array
s2 = pd.Series(np.array([1,2,3,4,5]))

# Create Series from a dictionary
s3 = pd.Series({'a':1, 'b':2,'c':3,})


In [41]:
#Indexing and slising series

# Accessing elements by label
print(s3['a'])

# Accessing elements by position
print(s3[0])

# Slicing
print(s3[:2])

1
1
a    1
b    2
dtype: int64


In [44]:
#Operations on Series

# Arithmetic operations
s4 = s1 + s2
print(s4)

# Element-wise operations
s5 = s1*2
print(s5)

# Aggregation functions
print(s1.sum())
print(s1.max())

0     2
1     4
2     6
3     8
4    10
dtype: int64
0     2
1     4
2     6
3     8
4    10
dtype: int64
15
5


In [45]:
#Handling the missimg data

# Drop missing values
s6 = s1.dropna()

# Fill missing values
s7 = s1.fillna(0)

# Chack for missimg values
print(s1.isnull())

0    False
1    False
2    False
3    False
4    False
dtype: bool


### Working with DataFrames

In [4]:
#Creating DataFrames

# Create DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age':[25,30,35]}
df1 = pd.DataFrame(data)
print(df1)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [2]:
import pandas as pd

# Create DataFrame from a list of lists

data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]
df2 = pd.DataFrame(data, columns=['Name', 'Age'])
print(df2)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [49]:
# Inddexing and slicing data frames

# Label-based indexing
print(df1.loc[0, 'Name'])

# Position-based indexing
print(df1.iloc[0,0])

# Slicing
print(df1[:2])

Alice
Alice
    Name  Age
0  Alice   25
1    Bob   30


In [5]:
#Basic operations
# Sorting
df1_sorted = df1.sort_values(by='Age')
print(df1_sorted)
# Filtering
df1_filtered = df1[df1['Age'] > 30]
print(df1_filtered)
# Selecting columns
names = df1['Name']
print(names)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
      Name  Age
2  Charlie   35
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


In [6]:
#Data manipulation
# Adding a column
df1['Gender'] = ['Female', 'Male', 'Male']
# Deleting a column
del df1['Gender']
# Updating a column
df1['Age'] = df1['Age'] + 1

### Data Cleaning and Preparation

In [7]:
#Handling missing values
# Drop missing values
df1_cleaned = df1.dropna()
# Fill missing values
df1_filled = df1.fillna(0)
# Check for missing values
print(df1.isnull().any())

Name    False
Age     False
dtype: bool


In [10]:
#Data normalization(numerical value only)
# Min-Max normalization
df1_normalized = (df1 - df1.min()) / (df1.max() - df1.min())
# Standardization
df1_standardized = (df1 - df1.mean()) / df1.std()

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [None]:
#Data transformation
# Log transformation
df1['Log_Age'] = np.log(df1['Age'])
# Encoding categorical variables
df1_encoded = pd.get_dummies(df1, columns=['Gender'])

In [None]:
#Data aggregation and grouping
# Group by age and calculate mean age
age_groups = df1.groupby('Age')
mean_age = age_groups['Age'].mean()

## Advance Pandas

In [None]:
## Merging, Joining, and Concatenating DataFrames
#Concatenating DataFrames
import pandas as pd
# Create sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
 'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
 'B': ['B3', 'B4', 'B5']})
# Concatenate along rows
result_row = pd.concat([df1, df2])
# Concatenate along columns
result_col = pd.concat([df1, df2], axis=1)

In [None]:
#Merging DataFrames
# Create sample DataFrames
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
 'value': ['V0', 'V1', 'V2']})
right = pd.DataFrame({'key': ['K1', 'K2', 'K3'],
 'value': ['V1', 'V2', 'V3']})
# Inner join
inner_join = pd.merge(left, right, on='key', how='inner')
# Left join
left_join = pd.merge(left, right, on='key', how='left')
# Right join
right_join = pd.merge(left, right, on='key', how='right')
# Outer join
outer_join = pd.merge(left, right, on='key', how='outer')

In [None]:
#Joining DataFrames
# Create sample DataFrames
left = pd.DataFrame({'value1': [1, 2, 3]}, index=['a', 'b', 'c'])
right = pd.DataFrame({'value2': [4, 5, 6]}, index=['a', 'b', 'd'])
# Join based on index
join_df = left.join(right, how='inner')


## Part 4: Technical Level Questions and Exercises

In [12]:
###Implementing Matrix Operations: Matrix Multiplication
import numpy as np
# Create two matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

In [None]:
## Question: Multiply matrices A and B.
# Answer
C = np.dot(A, B)
print(C) # Output: [[19 22] [43 50]]

In [None]:
## Question: Matrix Inversion - Find the inverse of matrix A.
# Answer
A_inv = np.linalg.inv(A)
print(A_inv) # Output: [[-2. 1.] [ 1.5 -0.5]]

In [13]:
## Question: Calculate the eigenvalues and eigenvectors of matrix A.
# Answer
eigenvalues, eigenvectors = np.linalg.eig(A)
print(eigenvalues) # Output: [5.37228132 -0.37228132]
print(eigenvectors) # Output: [[-0.82456484 -0.41597356] [-0.56576746 0.9

[-0.37228132  5.37228132]
[[-0.82456484 -0.41597356]
 [ 0.56576746 -0.90937671]]


In [14]:
## Question: Solve the system of equations 3x + y = 9 and x + 2y = 8.
# Solving a system of linear equations
A = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])
x = np.linalg.solve(A, b)
print(x)
# Answer
x = np.linalg.solve(A, b)
print(x) # Output: [2. 3.]


[2. 3.]
[2. 3.]


In [15]:
import numpy as np
# Define the 2D array
arr15 = np.array([[1, 2, 3],
 [4, 5, 6],
 [7, 8, 9]])
print("Array:")
print(arr15)
# Define row and column indices
row_indices = np.array([1, 2])
column_indices = np.array([0, 1])
# Use a tuple for fancy indexing
selected_elements = arr15[(row_indices, column_indices)]
print("\nSelected Elements:")
print(selected_elements)

Array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Selected Elements:
[4 8]


## NumPy Practice Questions:
1. Array Creation: Create a NumPy array of integers from 1 to 10.
2. Array Operations: Perform element-wise addition, subtraction, multiplication, and
division on two NumPy arrays.
3. Array Slicing: Extract the first three elements from a NumPy array.
4. Array Reshaping: Reshape a 1D NumPy array into a 2D array with 2 rows and 3
columns.
5. Array Aggregation: Calculate the mean, median, and standard deviation of a NumPy
array

## Pandas Practice Questions:
1. DataFrame Creation: Create a pandas DataFrame from a dictionary of lists, where
each list represents a column.
2. DataFrame Operations: Add a new column to an existing DataFrame. Perform
element-wise operations between two columns.
3. Data Selection: Select rows based on a condition. Select specific columns from a
DataFrame.
4. Data Aggregation: Group the data in a DataFrame based on a categorical column and
calculate the mean of each group.
5. Data Cleaning: Handle missing values by either removing or replacing them with
appropriate values.