1) Data Wrangling, I
Perform the following operations using Python on any open-source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open-source data from the web (e.g., https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull (), describe ()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python. 

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing

In [4]:
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']
iris = pd.read_csv(csv_url, names=col_names)

In [10]:
# Display first 5 rows
print("First 5 Rows (head()):\n", iris.head())

First 5 Rows (head()):
    Sepal_Length  Sepal_Width  Petal_Length  Petal_Width      Species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


In [11]:
# 3.1 Check Missing Values
print("\nCheck Missing Values (isnull()):\n", iris.isnull())


Check Missing Values (isnull()):
      Sepal_Length  Sepal_Width  Petal_Length  Petal_Width  Species
0           False        False         False        False    False
1           False        False         False        False    False
2           False        False         False        False    False
3           False        False         False        False    False
4           False        False         False        False    False
..            ...          ...           ...          ...      ...
145         False        False         False        False    False
146         False        False         False        False    False
147         False        False         False        False    False
148         False        False         False        False    False
149         False        False         False        False    False

[150 rows x 5 columns]


In [12]:
print("\nAny Missing Values in Columns (isnull().any()):\n", iris.isnull().any())



Any Missing Values in Columns (isnull().any()):
 Sepal_Length    False
Sepal_Width     False
Petal_Length    False
Petal_Width     False
Species         False
dtype: bool


In [13]:
print("\nTotal Missing Values (isnull().sum().sum()):\n", iris.isnull().sum().sum())



Total Missing Values (isnull().sum().sum()):
 0


In [14]:
# 3.2 Describe the Dataset
print("\nDescribe the Dataset:\n", iris.describe())


Describe the Dataset:
        Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


In [15]:
# 3.3 Dataset Index
print("\nDataset Index:\n", iris.index)


Dataset Index:
 RangeIndex(start=0, stop=150, step=1)


In [16]:

# 3.4 Dataset Columns
print("\nDataset Columns:\n", iris.columns)


Dataset Columns:
 Index(['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width',
       'Species'],
      dtype='object')


In [17]:
# 3.5 Dataset Shape
print("\nDataset Shape (Rows, Columns):\n", iris.shape)


Dataset Shape (Rows, Columns):
 (150, 5)


In [18]:
# 3.6 Dataset Data Types
print("\nDataset Data Types (dtypes):\n", iris.dtypes)


Dataset Data Types (dtypes):
 Sepal_Length    float64
Sepal_Width     float64
Petal_Length    float64
Petal_Width     float64
Species          object
dtype: object


In [19]:
# 3.7 Read Data Column-wise
print("\nSepal_Length Column:\n", iris['Sepal_Length'])


Sepal_Length Column:
 0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: Sepal_Length, Length: 150, dtype: float64


In [20]:
# 3.8 Read Specific Row by iloc
print("\nData at 5th Index (iloc[5]):\n", iris.iloc[5])


Data at 5th Index (iloc[5]):
 Sepal_Length            5.4
Sepal_Width             3.9
Petal_Length            1.7
Petal_Width             0.4
Species         Iris-setosa
Name: 5, dtype: object


In [21]:
# 3.9 Read Rows 0 to 2 (Slicing)
print("\nRows 0 to 2 (iris[0:3]):\n", iris[0:3])


Rows 0 to 2 (iris[0:3]):
    Sepal_Length  Sepal_Width  Petal_Length  Petal_Width      Species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa


In [22]:
# 3.10 Select specific columns using loc
print("\nSelect Sepal_Length and Sepal_Width Columns:\n", iris.loc[:, ["Sepal_Length", "Sepal_Width"]])


Select Sepal_Length and Sepal_Width Columns:
      Sepal_Length  Sepal_Width
0             5.1          3.5
1             4.9          3.0
2             4.7          3.2
3             4.6          3.1
4             5.0          3.6
..            ...          ...
145           6.7          3.0
146           6.3          2.5
147           6.5          3.0
148           6.2          3.4
149           5.9          3.0

[150 rows x 2 columns]


In [23]:
# 3.11 Subset first 5 rows (iloc)
print("\nFirst 5 Rows Subset (iloc[:5, :]):\n", iris.iloc[:5, :])


First 5 Rows Subset (iloc[:5, :]):
    Sepal_Length  Sepal_Width  Petal_Length  Petal_Width      Species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


In [24]:
# 3.12 Subset first 3 columns (iloc)
print("\nFirst 3 Columns Subset (iloc[:, :3]):\n", iris.iloc[:, :3])


First 3 Columns Subset (iloc[:, :3]):
      Sepal_Length  Sepal_Width  Petal_Length
0             5.1          3.5           1.4
1             4.9          3.0           1.4
2             4.7          3.2           1.3
3             4.6          3.1           1.5
4             5.0          3.6           1.4
..            ...          ...           ...
145           6.7          3.0           5.2
146           6.3          2.5           5.0
147           6.5          3.0           5.2
148           6.2          3.4           5.4
149           5.9          3.0           5.1

[150 rows x 3 columns]


In [25]:
# 3.13 Subset first 5 rows and first 3 columns (iloc)
print("\nSubset first 5 Rows and 3 Columns (iloc[:5, :3]):\n", iris.iloc[:5, :3])


Subset first 5 Rows and 3 Columns (iloc[:5, :3]):
    Sepal_Length  Sepal_Width  Petal_Length
0           5.1          3.5           1.4
1           4.9          3.0           1.4
2           4.7          3.2           1.3
3           4.6          3.1           1.5
4           5.0          3.6           1.4


In [26]:
# -----------------------------------------------
# Step 4: Data Formatting and Normalization
# -----------------------------------------------

# 4.1 Check Data Types Again
print("\nData Types Before Formatting:\n", iris.dtypes)


Data Types Before Formatting:
 Sepal_Length    float64
Sepal_Width     float64
Petal_Length    float64
Petal_Width     float64
Species          object
dtype: object


In [28]:
# 4.2 If Needed, Convert Data Types (example shown - not necessary here)
iris['Sepal_Length'] = iris['Sepal_Length'].astype(int)



In [29]:
# 4.3 Normalize Numeric Data (Min-Max Scaling)
numeric_columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']

In [30]:
# Create Scaler
min_max_scaler = preprocessing.MinMaxScaler()

In [31]:
# Apply Normalization
iris[numeric_columns] = min_max_scaler.fit_transform(iris[numeric_columns])

In [32]:
# View Normalized Data
print("\nNormalized Data (0-1 Range):\n", iris.head())


Normalized Data (0-1 Range):
    Sepal_Length  Sepal_Width  Petal_Length  Petal_Width      Species
0      0.333333     0.625000      0.067797     0.041667  Iris-setosa
1      0.000000     0.416667      0.067797     0.041667  Iris-setosa
2      0.000000     0.500000      0.050847     0.041667  Iris-setosa
3      0.000000     0.458333      0.084746     0.041667  Iris-setosa
4      0.333333     0.666667      0.067797     0.041667  Iris-setosa


In [33]:
# -----------------------------------------------
# Step 5: Convert Categorical Variables to Quantitative
# -----------------------------------------------

# 5.1 Label Encode the 'Species' column
label_encoder = preprocessing.LabelEncoder()
iris['Species'] = label_encoder.fit_transform(iris['Species'])

In [34]:
# View After Label Encoding
print("\nAfter Label Encoding Species:\n", iris.head())


After Label Encoding Species:
    Sepal_Length  Sepal_Width  Petal_Length  Petal_Width  Species
0      0.333333     0.625000      0.067797     0.041667        0
1      0.000000     0.416667      0.067797     0.041667        0
2      0.000000     0.500000      0.050847     0.041667        0
3      0.000000     0.458333      0.084746     0.041667        0
4      0.333333     0.666667      0.067797     0.041667        0


In [35]:
# Check Encoded Values
print("\nUnique Encoded Species Labels:\n", iris['Species'].unique())


Unique Encoded Species Labels:
 [0 1 2]


📖 Theory: Functions Used in Data Wrangling Practical
1. pd.read_csv()
Purpose and Explanation:
pd.read_csv() is a function from the pandas library used to load a dataset stored in a CSV (Comma Separated Values) file format into a pandas DataFrame. It parses the file, identifies the fields, and structures the data into a tabular format. In our practical, we used pd.read_csv() to load the Iris dataset from the UCI Machine Learning Repository. Since the dataset did not have headers, we assigned the column names manually.

Example:
We loaded the data using:

python
Copy
Edit
iris = pd.read_csv(csv_url, names=col_names)
Importance:
Reading structured data into DataFrames is the very first and most crucial step in any data science project, as all further operations depend on correctly loaded data.

2. head()
Purpose and Explanation:
The head() function displays the first few records (default 5 rows) of the DataFrame. It helps in quick verification of whether the data has been loaded properly and the columns have been correctly assigned.

Example:

python
Copy
Edit
iris.head()
Importance:
This function is helpful for initial exploratory data analysis to observe the structure of the dataset.

3. isnull()
Purpose and Explanation:
The isnull() function is used to identify missing values in a DataFrame. It returns a Boolean DataFrame indicating True where the data is missing.

Example:

python
Copy
Edit
iris.isnull()
Importance:
Detecting missing values early is important because they can affect model performance and lead to incorrect analysis if not handled properly.

4. isnull().any()
Purpose and Explanation:
isnull().any() checks each column individually and returns True if any missing values are present in that column. It is a quicker way to detect missing data across columns.

Example:

python
Copy
Edit
iris.isnull().any()
Importance:
This is useful for summarizing which columns need cleaning before proceeding further.

5. isnull().sum().sum()
Purpose and Explanation:
isnull().sum().sum() computes the total number of missing values across the entire DataFrame. First, it sums missing values per column and then aggregates them.

Example:

python
Copy
Edit
iris.isnull().sum().sum()
Importance:
It gives a complete picture of how much data is missing in the dataset.

6. describe()
Purpose and Explanation:
The describe() function provides descriptive statistics like mean, standard deviation, min, max, and quartiles for numeric columns. It helps to quickly understand the distribution and spread of the data.

Example:

python
Copy
Edit
iris.describe()
Importance:
Descriptive statistics are important for understanding data variability and detecting outliers.

7. index
Purpose and Explanation:
The index attribute shows the range of row labels in the DataFrame. It tells how many entries are present and the starting and ending points of the DataFrame.

Example:

python
Copy
Edit
iris.index
Importance:
Understanding indexing is crucial for operations like slicing and selecting data.

8. columns
Purpose and Explanation:
The columns attribute lists all the column names of the DataFrame. It helps to verify if the data has correct feature labels.

Example:

python
Copy
Edit
iris.columns
Importance:
Knowing column names is important for referencing, selecting, and analyzing specific features.

9. shape
Purpose and Explanation:
The shape attribute returns a tuple representing the dimensions of the DataFrame as (rows, columns).

Example:

python
Copy
Edit
iris.shape
Importance:
It is important for understanding the size of the dataset and planning preprocessing steps accordingly.

10. dtypes
Purpose and Explanation:
The dtypes attribute displays the datatype of each column, such as float64, int64, or object. Knowing the datatype is crucial for choosing the right data transformations.

Example:

python
Copy
Edit
iris.dtypes
Importance:
It helps ensure that the data types align with the kind of operations we plan to perform.

11. iloc[]
Purpose and Explanation:
iloc[] stands for integer-location based indexing. It allows selection by row and column index positions, making it useful for slicing the DataFrame based on numerical indices.

Example:

python
Copy
Edit
iris.iloc[5]
Importance:
Essential for selecting data when the exact row/column labels are not known.

12. loc[]
Purpose and Explanation:
loc[] allows label-based indexing. It is used for selecting subsets of data based on column names and row indices explicitly.

Example:

python
Copy
Edit
iris.loc[:, ['Sepal_Length', 'Sepal_Width']]
Importance:
Useful when we want to select data based on actual labels rather than numeric indices.

13. astype()
Purpose and Explanation:
The astype() function is used to change the datatype of a column. For example, converting a column from float64 to int64.

Example:

python
Copy
Edit
iris['Sepal_Length'] = iris['Sepal_Length'].astype('float')
Importance:
Helps correct datatype issues that could affect data analysis and model training.

14. preprocessing.MinMaxScaler()
Purpose and Explanation:
MinMaxScaler() from the sklearn library normalizes feature values to a standard range between 0 and 1. It is very important for algorithms that are sensitive to the scale of data.

Example:

python
Copy
Edit
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
Importance:
Ensures that no feature dominates others due to larger magnitude and speeds up model convergence.

15. LabelEncoder()
Purpose and Explanation:
LabelEncoder() transforms categorical values (strings) into numerical values (integers). It is necessary because machine learning models work with numerical input.

Example:

python
Copy
Edit
label_encoder = preprocessing.LabelEncoder()
iris['Species'] = label_encoder.fit_transform(iris['Species'])
Importance:
Converting string labels to numerical form enables the use of categorical data in supervised learning models.

