# **Marriage Trends in India: Love vs. Arranged** 

#### **A Dataset of Marriages Analyzing Social, Economic, and Cultural Factors**

-----------------------------------------------------------------------------------------------------------

##### **Objective:**  To perform data preprocessing, cleaning, and exploratory data analysis (EDA) on the "Marriage Trends in India" dataset, and create visualizations to uncover insights into love vs. arranged marriages using Python libraries like Pandas, NumPy, and Matplotlib

-----------------------------------------------------------------------------------------------------------

#### **Step 1: Import necessary libraries**

In [2]:
import pandas as pd

In [3]:
import numpy as np

#### **Step 2: Load the dataset**

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/0shanx/0shanx/refs/heads/main/marriage_data_india.csv')


#####  Display the first 5 rows of the dataset to get a quick look at the data

In [5]:
print(df.head())

   ID Marriage_Type  Age_at_Marriage  Gender Education_Level Caste_Match  \
0   1          Love               23    Male        Graduate   Different   
1   2          Love               28  Female          School        Same   
2   3      Arranged               39    Male    Postgraduate        Same   
3   4      Arranged               26  Female          School   Different   
4   5          Love               32  Female        Graduate        Same   

  Religion Parental_Approval Urban_Rural Dowry_Exchanged Marital_Satisfaction  \
0    Hindu                No       Urban              No               Medium   
1    Hindu               Yes       Rural             Yes                  Low   
2   Muslim               Yes       Rural              No               Medium   
3    Hindu               Yes       Urban             Yes                  Low   
4    Hindu           Partial       Rural             Yes               Medium   

  Divorce_Status  Children_Count Income_Level  Years_Sin

#### **Step 3: Understand the dataset**

##### Check the columns in the dataset

In [6]:
print(df.columns)

Index(['ID', 'Marriage_Type', 'Age_at_Marriage', 'Gender', 'Education_Level',
       'Caste_Match', 'Religion', 'Parental_Approval', 'Urban_Rural',
       'Dowry_Exchanged', 'Marital_Satisfaction', 'Divorce_Status',
       'Children_Count', 'Income_Level', 'Years_Since_Marriage',
       'Spouse_Working', 'Inter-Caste', 'Inter-Religion'],
      dtype='object')


##### Check the data types of each column

In [7]:
print(df.dtypes)

ID                       int64
Marriage_Type           object
Age_at_Marriage          int64
Gender                  object
Education_Level         object
Caste_Match             object
Religion                object
Parental_Approval       object
Urban_Rural             object
Dowry_Exchanged         object
Marital_Satisfaction    object
Divorce_Status          object
Children_Count           int64
Income_Level            object
Years_Since_Marriage     int64
Spouse_Working          object
Inter-Caste             object
Inter-Religion          object
dtype: object


##### Check for missing values in each column

In [8]:
print(df.isnull().sum())

ID                      0
Marriage_Type           0
Age_at_Marriage         0
Gender                  0
Education_Level         0
Caste_Match             0
Religion                0
Parental_Approval       0
Urban_Rural             0
Dowry_Exchanged         0
Marital_Satisfaction    0
Divorce_Status          0
Children_Count          0
Income_Level            0
Years_Since_Marriage    0
Spouse_Working          0
Inter-Caste             0
Inter-Religion          0
dtype: int64


##### Check the shape of the dataset (number of rows and columns)

In [9]:
print(df.shape)

(10000, 18)


Display summary statistics for numerical columns

In [10]:
print(df.describe())

                ID  Age_at_Marriage  Children_Count  Years_Since_Marriage
count  10000.00000     10000.000000    10000.000000          10000.000000
mean    5000.50000        28.503800        2.508800             24.973800
std     2886.89568         6.279564        1.695467             14.054838
min        1.00000        18.000000        0.000000              1.000000
25%     2500.75000        23.000000        1.000000             13.000000
50%     5000.50000        29.000000        3.000000             25.000000
75%     7500.25000        34.000000        4.000000             37.000000
max    10000.00000        39.000000        5.000000             49.000000


#### **Step 4: Handle missing values (if any)**

##### **Option 1:** Drop rows with missing values (if the dataset is large enough)
##### df = df.dropna() 

##### **Option 2:** Fill missing values with mean, median, or mode
##### For numerical columns, fill missing values with the mean
##### df['column_name'].fillna(df['column_name'].mean(), inplace=True)

##### For categorical columns, fill missing values with the mode
##### df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)

##### **As Our data is having no missing values so we skip this part**

#### **Step 5: Remove duplicate rows (if any)**

In [11]:
df = df.drop_duplicates()

#### **Step 6: Data Cleaning & Transformation**

##### Convert categorical variables into numerical form

In [None]:
from sklearn.preprocessing import LabelEncoder