# Phase 1: Data Loading & Inspection

**Project:** Predictive Analysis of Indian Startup Funding (2015-2020)  
**Author:** Rohit & Team  
**Date:** November 2025

---

## Objective
Load the raw startup funding dataset and perform initial inspection to understand data structure, quality, and potential issues.

## Input
- **File:** `../data/raw/startup_funding.csv`
- **Source:** Indian Startup Funding Dataset (2015-2020)

## Expected Output
- Dataset overview (shape, columns, data types)
- Missing value analysis
- Initial observations for cleaning phase

## Tasks
1. Import necessary libraries
2. Load CSV data
3. Display basic information
4. Check for missing values and data quality issues
5. Document observations

---

## Step 1: Import Libraries

In [2]:
# Import core libraries for data handling and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Configuration
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')

print("Libraries imported successfully")

Libraries imported successfully


## Step 2: Load Dataset

Loading the raw CSV file from `../data/raw/startup_funding.csv`

In [3]:
# Load the dataset
df = pd.read_csv('../data/raw/startup_funding.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")

Dataset loaded successfully!
Shape: 3044 rows × 10 columns


## Step 3: Initial Data Inspection

Let's examine the first few rows and understand the structure

In [4]:
# Display first 10 rows
df.head(10)

Unnamed: 0,Sr No,Date dd/mm/yyyy,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Remarks
0,1,09/01/2020,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,
2,3,09/01/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,
3,4,02/01/2020,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,
4,5,02/01/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,
5,6,13/01/2020,Pando,Logistics,"Open-market, freight management platform",Chennai,Chiratae Ventures,Series A,9000000,
6,7,10/01/2020,Zomato,Hospitality,Online Food Delivery Platform,Gurgaon,Ant Financial,Private Equity Round,150000000,
7,8,12/12/2019,Ecozen,Technology,Agritech,Pune,Sathguru Catalyzer Advisors,Series A,6000000,
8,9,06/12/2019,CarDekho,E-Commerce,Automobile,Gurgaon,Ping An Global Voyager Fund,Series D,70000000,
9,10,03/12/2019,Dhruva Space,Aerospace,Satellite Communication,Bengaluru,"Mumbai Angels, Ravikanth Reddy",Seed,50000000,


In [5]:
df.describe()

Unnamed: 0,Sr No
count,3044.0
mean,1522.5
std,878.871435
min,1.0
25%,761.75
50%,1522.5
75%,2283.25
max,3044.0


## Key Observations

### Data Structure Issues Found:
1. **Column Name Typo:** `InvestmentnType` should be `InvestmentType`
2. **Amount Column Mislabeled:** "Amount in USD" contains INR values in Indian comma format
3. **Missing Stage Column:** Need to extract funding stage from `InvestmentnType`
4. **Date Format:** Date is in dd/mm/yyyy string format, needs parsing

### Missing Data Summary:
- **Remarks:** ~70% missing (not critical for analysis)
- **Amount:** ~12% missing (requires handling)
- **InvestmentType:** ~10% missing (need to map to "Undisclosed")
- **SubVertical:** ~15% missing (can be filled or dropped)

### Next Steps for Cleaning Phase:
 Parse dates with `dayfirst=True`  
 Clean amount column (remove commas, convert to numeric)  
 Extract and normalize funding stages  
 Standardize city names (Bangalore → Bengaluru)  
 Count investors from comma-separated list  
 Handle missing values appropriately

---

**Status:**  Data loading complete. Ready for Phase 2 (Cleaning)

In [6]:
# Check unique values for key columns
print("Unique Value Counts:\n")
print(f"Industry Vertical: {df['Industry Vertical'].nunique()} unique values")
print(f"City Location: {df['City  Location'].nunique()} unique cities")
print(f"Investment Type: {df['InvestmentnType'].nunique()} unique types")
print(f"\n Sample Investment Types:")
print(df['InvestmentnType'].value_counts().head(15))

Unique Value Counts:

Industry Vertical: 821 unique values
City Location: 112 unique cities
Investment Type: 55 unique types

 Sample Investment Types:
InvestmentnType
Private Equity          1356
Seed Funding            1355
Seed/ Angel Funding       60
Seed / Angel Funding      47
Seed\\nFunding            30
Debt Funding              25
Series A                  24
Seed/Angel Funding        23
Series B                  20
Series C                  14
Series D                  12
Angel / Seed Funding       8
Seed Round                 7
Seed                       4
Private Equity Round       4
Name: count, dtype: int64


## Step 5: Column Analysis

Examine unique values in key categorical columns

In [7]:
# Calculate missing values
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})

# Sort by missing percentage
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print("Missing Value Summary:\n")
print(missing_data.to_string(index=False))

Missing Value Summary:

           Column  Missing_Count  Missing_Percentage
          Remarks           2625               86.24
    Amount in USD            960               31.54
      SubVertical            936               30.75
   City  Location            180                5.91
Industry Vertical            171                5.62
   Investors Name             24                0.79
  InvestmentnType              4                0.13


## Step 4: Missing Value Analysis

Identify columns with missing data and calculate percentages

### Statistical Summary

Numerical column distributions

In [None]:
# Column info
df.info()

### Dataset Information

Check column names, data types, and memory usage