# ### Types of Data

1. **Numerical Data (Quantitative)**
   - **Integer (int):** Whole numbers without a fractional part (e.g., 1, 42, -7).
   - **Float (float):** Numbers with a fractional part (e.g., 3.14, -2.7).

2. **Boolean Data**
   - Represents binary values (True/False).

3. **Text Data (String)**
   - Sequence of characters (e.g., names, addresses).

4. **Date and Time Data (Datetime)**
   - Represents dates and times (e.g., 2023-01-01, 14:30:00).

**Data Types in Pandas:**
- **int64:** Integer data type.
- **float64:** Floating-point data type.
- **object:** General data type for strings and mixed types.
- **bool:** Boolean data type.
- **datetime64:** Date and time data type. 

Choosing the appropriate data type is essential for efficient storage and accurate computations in data analysis and machine learning.

# ### Pandas Library

**Pandas** is essential for machine learning and AI due to its powerful data manipulation and analysis capabilities.

**Key Features:**
1. **DataFrames and Series:** Provides flexible and powerful data structures for handling structured data.
2. **Data Manipulation:** Offers extensive functions for data cleaning, transformation, and analysis.
3. **Handling Missing Data:** Efficiently manages and fills missing data in datasets.
4. **Integration:** Seamlessly integrates with other libraries like NumPy and Matplotlib.

**Benefits for ML and AI:**
- **Data Preparation:** Simplifies data preprocessing and feature engineering.
- **Convenience:** Provides easy-to-use data structures and functions.
- **Efficiency:** Enhances performance for data handling and analysis tasks.

Pandas is a foundational tool in machine learning and AI for efficient data preparation and analysis.

In [1]:
!pip install pandas


Defaulting to user installation because normal site-packages is not writeable


In [2]:
import pandas as pd  # Importing pandas for data manipulation and analysis



In [3]:
# Creating a DataFrame from a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],  # List of names
    "Age": [25, 30, 35],  # List of ages
    "City": ["New York", "Los Angeles", "Chicago"]  # List of cities
}

data  # The dictionary containing the data



{'Name': ['Alice', 'Bob', 'Charlie'],
 'Age': [25, 30, 35],
 'City': ['New York', 'Los Angeles', 'Chicago']}

In [4]:
df = pd.DataFrame(data)  # Converting the dictionary to a pandas DataFrame

# Displaying the DataFrame
print("Pandas DataFrame:")  # Print a label for the output
df  # Print the DataFrame


Pandas DataFrame:


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [5]:
# Selecting Columns:
ages = pd.DataFrame(df["Age"])  # Extracting the 'Age' column and converting it to a new DataFrame
ages  # Displaying the new DataFrame containing only the 'Age' column



Unnamed: 0,Age
0,25
1,30
2,35


In [6]:
# Filtering rows based on a condition
age_above_28 = pd.DataFrame(df[df["Age"] > 28])  # Creating a new DataFrame with rows where 'Age' is greater than 28
age_above_28  # Displaying the filtered DataFrame



Unnamed: 0,Name,Age,City
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [7]:
# Adding a new column
df["Occupation"] = ["Engineer", "Doctor", "Artist"]  # Adding an 'Occupation' column to the DataFrame
print("DataFrame with new column:\n")  # Print a label for the output
df  # Displaying the updated DataFrame with the new column


DataFrame with new column:



Unnamed: 0,Name,Age,City,Occupation
0,Alice,25,New York,Engineer
1,Bob,30,Los Angeles,Doctor
2,Charlie,35,Chicago,Artist


In [8]:
# Modifying an existing column
df["Age"] = df["Age"] + 1  # Incrementing each value in the 'Age' column by 1
print("DataFrame with modified Age:\n")  # Print a label for the output
df  # Displaying the DataFrame with the modified 'Age' column


DataFrame with modified Age:



Unnamed: 0,Name,Age,City,Occupation
0,Alice,26,New York,Engineer
1,Bob,31,Los Angeles,Doctor
2,Charlie,36,Chicago,Artist


In [9]:
# Creating a DataFrame with missing values
data_with_nan = {
    "Name": ["Alice", "Bob", None],  # 'Name' column with a missing value
    "Age": [25, None, 35],  # 'Age' column with a missing value
    "City": ["New York", "Los Angeles", None]  # 'City' column with a missing value
}
df_nan = pd.DataFrame(data_with_nan)  # Converting the dictionary to a pandas DataFrame
print("DataFrame with missing values:\n")  # Print a label for the output
df_nan  # Displaying the DataFrame with missing values


DataFrame with missing values:



Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,,Los Angeles
2,,35.0,


In [10]:
# Filling missing values
df_filled = df_nan.fillna("Unknown")  # Filling missing values with "Unknown"
print("DataFrame with filled missing values:\n")  # Print a label for the output
df_filled  # Displaying the DataFrame with filled missing values


DataFrame with filled missing values:



Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,Unknown,Los Angeles
2,Unknown,35.0,Unknown


In [11]:
import pandas as pd  # Importing pandas for data manipulation and analysis

# Creating a sample DataFrame
data = {
    "Feature1": [10, 20, 30, 40, 50],  # List of values for 'Feature1'
    "Feature2": [5, 15, 25, 35, 45]  # List of values for 'Feature2'
}
df_stats = pd.DataFrame(data)  # Converting the dictionary to a pandas DataFrame

# Displaying the DataFrame
df_stats  # Displaying the sample DataFrame


Unnamed: 0,Feature1,Feature2
0,10,5
1,20,15
2,30,25
3,40,35
4,50,45


In [12]:
# Calculating standard deviation for each column
std_dev = df_stats.std()  # Calculating the standard deviation for each column in the DataFrame
print("Standard Deviation of each column:\n")  # Print a label for the output
std_dev  # Displaying the standard deviation of each column


Standard Deviation of each column:



Feature1    15.811388
Feature2    15.811388
dtype: float64

### NumPy Library

**NumPy** (Numerical Python) is essential for machine learning and AI due to its powerful capabilities for numerical computing.

**Key Features:**
1. **Efficient Arrays:** Provides efficient storage and operations for large arrays and matrices.
2. **Mathematical Functions:** Offers a wide range of functions for complex calculations.
3. **Broadcasting:** Allows arithmetic operations on arrays of different shapes without explicit loops.
4. **Performance:** Operations are fast and performed at C-speed, crucial for handling large datasets in ML and AI.

**Benefits for ML and AI:**
- **Speed:** Enhances performance for numerical computations.
- **Convenience:** Simplifies the implementation of algorithms.
- **Integration:** Works seamlessly with other libraries like Pandas and TensorFlow.

NumPy is a foundational tool in machine learning and AI for data manipulation and computational efficiency.


In [13]:
# Installing the NumPy package for numerical operations and large, multi-dimensional arrays
!pip install numpy

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import numpy as np  # Importing NumPy for numerical operations


In [3]:
# Creating a 1D array
array_1d = np.array([1, 2, 3, 4, 5])  # Creating a 1-dimensional NumPy array
print("1D Array:", array_1d)  # Printing the 1D array


1D Array: [1 2 3 4 5]


In [4]:
# Creating a 2D array (matrix)
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  # Creating a 2-dimensional NumPy array (matrix)
print("2D Array (Matrix):\n", array_2d)  # Printing the 2D array (matrix)


2D Array (Matrix):
 [[1 2 3]
 [4 5 6]
 [7 8 9]]


In [5]:
# Creating a 3D array (2 matrices of 3x3)
array_3d = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]], 
                     [[10, 11, 12], [13, 14, 15], [16, 17, 18]]])  # Creating a 3-dimensional NumPy array (2 matrices of 3x3)
print("3D Array:\n", array_3d)  # Printing the 3D array


3D Array:
 [[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]]


In [6]:
# Mathematical Operations:

# Sum of all elements
sum_array = np.sum(array_1d)  # Calculating the sum of all elements in the 1D array
print("Sum of 1D Array:", sum_array)  # Printing the sum of the 1D array

# Mean of all elements
mean_array = np.mean(array_1d)  # Calculating the mean of all elements in the 1D array
print("Mean of 1D Array:", mean_array)  # Printing the mean of the 1D array

# Squaring each element
squared_array = np.square(array_1d)  # Squaring each element in the 1D array
print("Squared 1D Array:", squared_array)  # Printing the squared 1D array


Sum of 1D Array: 15
Mean of 1D Array: 3.0
Squared 1D Array: [ 1  4  9 16 25]


In [7]:
(5,)

(5,)

In [8]:
# Reshaping a 1D array to a 2D array
reshaped_array = array_1d.reshape((5, 1))  # Reshaping the 1D array to a 2D array with 5 rows and 1 column
print("Reshaped Array (1D to 2D):\n", reshaped_array)  # Printing the reshaped 2D array


Reshaped Array (1D to 2D):
 [[1]
 [2]
 [3]
 [4]
 [5]]


In [9]:
# Transposing a 2D array (matrix)
transposed_matrix = np.transpose(array_2d)  # Transposing the 2D array (matrix)
print("2D Array (Matrix):\n", array_2d)  # Printing the original 2
print("Transposed Matrix:\n", transposed_matrix)  # Printing the transposed matrix



2D Array (Matrix):
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Transposed Matrix:
 [[1 4 7]
 [2 5 8]
 [3 6 9]]


In [11]:
# Matrix multiplication: you can google the math concept for more understanding
matrix_a = np.array([[1, 2], [3, 4]])  # Creating the first matrix
matrix_b = np.array([[5, 6], [7, 8]])  # Creating the second matrix
print("matrix_a:\n", matrix_a) 
print("matrix_b:\n", matrix_b) 
matrix_product = np.dot(matrix_a, matrix_b)  # Performing matrix multiplication
print("Matrix Product:\n", matrix_product)  # Printing the product of the two matrices


matrix_a:
 [[1 2]
 [3 4]]
matrix_b:
 [[5 6]
 [7 8]]
Matrix Product:
 [[19 22]
 [43 50]]


In [16]:
# Statistical Operations:

# Standard Deviation measures how spread out the numbers in a dataset are from the mean.
# Standard deviation of the array
std_dev = np.std(array_1d)  # Calculating the standard deviation of the 1D array
print("Standard Deviation of 1D Array:", std_dev)  # Printing the standard deviation of the 1D array

# Generating random numbers (useful for initializing weights in ML models)
random_numbers = np.random.randn(3, 3)  # Generating a 3x3 array of random numbers from a standard normal distribution
print("Random Numbers:\n", random_numbers)  # Printing the array of random numbers


Standard Deviation of 1D Array: 1.4142135623730951
Random Numbers:
 [[ 0.95057761 -0.19118869  1.14322294]
 [ 2.11721334  1.1969978  -0.05391155]
 [-1.27602653  0.40478584  1.54229   ]]


In [18]:
# Creating a random dataset
dataset = np.random.rand(10, 3)  # Generating a dataset with 10 samples and 3 features each, with values between 0 and 1
print("Dataset:\n", dataset)  # Printing the generated dataset


Dataset:
 [[0.75875293 0.63515349 0.28664695]
 [0.34719639 0.46884537 0.07145926]
 [0.04632218 0.71594408 0.75705402]
 [0.47079323 0.49047363 0.84481274]
 [0.62584534 0.45493254 0.49626731]
 [0.82668808 0.68637581 0.82806318]
 [0.90584383 0.77015677 0.66231961]
 [0.1808988  0.7721702  0.0698566 ]
 [0.99134241 0.56517309 0.66407787]
 [0.83674407 0.75496424 0.32287998]]


In [21]:
dataset [0]

array([0.75875293, 0.63515349, 0.28664695])

In [24]:
 dataset[:, 0]

array([0.75875293, 0.34719639, 0.04632218, 0.47079323, 0.62584534,
       0.82668808, 0.90584383, 0.1808988 , 0.99134241, 0.83674407])

In [24]:
# Selecting a specific column (feature) #slicing

feature_column = dataset[:, 1]  # Selecting all rows from the second column
print("Selected Feature Column:\n", feature_column)  # Printing the selected feature column


Selected Feature Column:
 [0.68597738 0.04809113 0.34103575 0.56906574 0.34726844 0.94795074
 0.54493997 0.35254728 0.40858989 0.2502299 ]


In [25]:
# Splitting data into training and test sets (80-20 split)
train_size = int(0.8 * dataset.shape[0])  # Calculating the training set size (80% of the total dataset)
train_set, test_set = dataset[:train_size], dataset[train_size:]  # Splitting the dataset into training and test sets
print("Training Set:\n", train_set)  # Printing the training set
print("Test Set:\n", test_set)  # Printing the test set


Training Set:
 [[0.84053851 0.68597738 0.0982776 ]
 [0.43414145 0.04809113 0.72197675]
 [0.12867571 0.34103575 0.75944955]
 [0.58576175 0.56906574 0.2365517 ]
 [0.49058705 0.34726844 0.52839662]
 [0.70559326 0.94795074 0.0333403 ]
 [0.45993356 0.54493997 0.46841854]
 [0.12665715 0.35254728 0.3953318 ]]
Test Set:
 [[0.70688435 0.40858989 0.44768117]
 [0.29843744 0.2502299  0.88463554]]


The website below is an excellent resource for learning Python. I highly recommend exploring it to build your foundation. It offers quizzes, tutorials, and valuable knowledge to teach you the basics.

https://www.w3schools.com/python/
