1. Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear
 description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps and
explain everything that you do to import/read/scrape the data set.

In [2]:
# 1. Import all the required Python Libraries
import pandas as pd
import numpy as np

# 2. Load the dataset into pandas dataframe
# Replace with your local path or use online URL if you have one
df = pd.read_csv(r"E:\DSBDAL\1\test.csv")

In [3]:
# 3. Initial View of the Dataset
print("First 5 rows of the dataset:")
print(df.head())

First 5 rows of the dataset:
   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S  


In [4]:
# 4. Data Preprocessing
print("\nChecking for missing values:")
print(df.isnull().sum())

print("\nDataset Description:")
print(df.describe())

print("\nVariable Types:")
print(df.dtypes)

print("\nShape of the DataFrame:")
print(df.shape)


Checking for missing values:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Dataset Description:
       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

Variable Types

In [7]:
# 5. Data Formatting and Normalization
# Convert object types if needed
df['Sex'] = df['Sex'].astype('category')
df['Embarked'] = df['Embarked'].astype('category')

# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])


# Confirm again
print("\nMissing values after filling:")
print(df.isnull().sum())


Missing values after filling:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [8]:
# 6. Turning categorical variables into quantitative variables
# Use one-hot encoding
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

# Final DataFrame overview
print("\nData types after conversion:")
print(df.dtypes)

print("\nFinal DataFrame (first 5 rows):")
print(df.head())


Data types after conversion:
PassengerId      int64
Pclass           int64
Name            object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Sex_male          bool
Embarked_Q        bool
Embarked_S        bool
dtype: object

Final DataFrame (first 5 rows):
   PassengerId  Pclass                                          Name   Age  \
0          892       3                              Kelly, Mr. James  34.5   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  47.0   
2          894       2                     Myles, Mr. Thomas Francis  62.0   
3          895       3                              Wirz, Mr. Albert  27.0   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  22.0   

   SibSp  Parch   Ticket     Fare Cabin  Sex_male  Embarked_Q  Embarked_S  
0      0      0   330911   7.8292   NaN      True        True       False  
1      1      0   363272  

Here's the **theory** (detailed explanation) for your practical on **Data Wrangling - I** using Python and Pandas:
Practical: Data Wrangling - I (Using Python)
Objective:
To perform data wrangling operations on an open-source dataset by importing, cleaning, formatting, and transforming the data using Python libraries such as Pandas and NumPy.
1.Importing Required Python Libraries:
python
import pandas as pd
import numpy as np
**Pandas:** Used for data manipulation and analysis. The `DataFrame` structure it provides makes it easy to handle tabular data.
-**NumPy:** A library for numerical operations and handling arrays efficiently. It is often used for mathematical operations and working with missing values.
2. Dataset Source and Description:**
**Dataset Used:** Titanic dataset (commonly used for machine learning and data preprocessing practice).
Source:** [Kaggle - Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data)
Description:** The dataset contains information about passengers on the Titanic, such as age, gender, class, survival status, and boarding details.
File Used:** `test.csv`
3. Loading the Dataset:**
python
df = pd.read_csv(r"E:\DSBDAL\1\test.csv")
The dataset is read using `pd.read_csv()`, and the contents are loaded into a DataFrame.
- An initial look at the data is given using:
  ```python
  print(df.head())
4. Data Preprocessing:**
a) Checking for Missing Values:
```python
print(df.isnull().sum())
This shows how many null values exist in each column.
b) Descriptive Statistics:
python
print(df.describe())
Provides summary statistics such as count, mean, standard deviation, min, and max for numeric columns.
c) Data Types and Dimensions:
```python
print(df.dtypes)
print(df.shape)
Helps identify types of each variable (e.g., `int64`, `object`, `float64`) and shape of the dataset (rows × columns).
5. Data Formatting and Normalization:**
a) Data Type Conversion:
```python
df['Sex'] = df['Sex'].astype('category')
df['Embarked'] = df['Embarked'].astype('category')
Converting object-type categorical variables into `category` improves memory usage and analysis.
b) Handling Missing Values:
```python
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
- Missing `Age` values are filled with the column’s mean.
- Missing `Embarked` values are filled with the most frequent value (mode).
6. Encoding Categorical Variables:**

```python
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
 Converts categorical variables into numeric form using **one-hot encoding**.
- `drop_first=True` avoids multicollinearity by removing one of the dummy columns.
**Conclusion:**

- Data wrangling involves steps like importing data, identifying missing values, converting data types, and encoding categorical variables.
- After this process, the dataset becomes clean, consistent, and ready for data analysis or machine learning tasks.
Learning Outcomes:**
- Ability to load and inspect a real-world dataset.
- Familiarity with common preprocessing techniques like handling missing data and encoding.
- Understanding of data types and proper data formatting practices.