# Task 1: Data Handling with NumPy & Pandas

### 1. Introduction
Problem Statement

The objective of this task is to understand the fundamentals of data handling by loading, inspecting, cleaning, and analyzing a structured dataset using pandas and NumPy. Data preprocessing is an essential step in data science to ensure accuracy and consistency before performing analysis or building models.

### 2. Dataset Description

The Iris Dataset is used for this task.
It contains numerical measurements of iris flowers along with their species category.
This dataset is widely used for demonstrating data analysis and preprocessing techniques.

# 3. Import Required Libraries

In [10]:
# Import pandas for data manipulation and analysis
import pandas as pd

# Import NumPy for numerical computations
import numpy as np


# 4. Load the Dataset

In [20]:
# Load the CSV dataset into a pandas DataFrame
df = pd.read_csv("/Iris_missingdata.csv")


# 5. Dataset Inspection

In [23]:
# Display the number of rows and columns
df.shape

(150, 6)

In [24]:
# Display dataset information such as data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  139 non-null    float64
 2   SepalWidthCm   143 non-null    float64
 3   PetalLengthCm  142 non-null    float64
 4   PetalWidthCm   141 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [25]:
# Display the column names
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [26]:
# Display the first five rows of the dataset
df.head()


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


# 6. Data Cleaning
## 6.1 Handling Missing Values

In [27]:
# Check for missing values in each column
df.isnull().sum()

Unnamed: 0,0
Id,0
SepalLengthCm,11
SepalWidthCm,7
PetalLengthCm,8
PetalWidthCm,9
Species,0


In [28]:
# Fill missing numerical values with the column mean
df.fillna(df.mean(numeric_only=True), inplace=True)

In [34]:
df.isnull().sum()

Unnamed: 0,0
Id,0
SepalLengthCm,0
SepalWidthCm,0
PetalLengthCm,0
PetalWidthCm,0
Species,0


# 6.2 Removing Duplicate Records

In [29]:
# Check the number of duplicate rows
df.duplicated().sum()

np.int64(0)

In [30]:
# Remove duplicate rows from the dataset
df.drop_duplicates(inplace=True)

# 7. NumPy Array Operations and Statistics

In [31]:
# Select only numerical columns and convert them to a NumPy array
numeric_array = df.select_dtypes(include=np.number).values

# Calculate the mean of the numerical data
mean_value = np.mean(numeric_array)

# Calculate the median of the numerical data
median_value = np.median(numeric_array)

# Calculate the standard deviation of the numerical data
std_deviation = np.std(numeric_array)

# Display the calculated statistics
mean_value, median_value, std_deviation


(np.float64(17.874187903895844),
 np.float64(4.15),
 np.float64(34.75975344367721))

# 8. Conclusion

In this task, the dataset was successfully loaded and inspected using pandas. Data cleaning techniques were applied to handle missing values and remove duplicate entries. NumPy was then used to compute basic statistical measures such as mean, median, and standard deviation. This task builds a strong foundation for data analysis and machine learning workflows.