# Data Cleaning and State Mapping for Startup Locations

This notebook performs data cleaning and processing on a dataset of startups in India, contained in the file `Listofstartups.csv`. The steps involved include:

1. **Loading the Data**:
   - The data is loaded from a CSV file into a Pandas DataFrame using `pd.read_csv()`.

2. **Initial Exploration**:
   - `df.describe()` and `df.head()` are used to explore the basic statistics and the first few rows of the dataset to understand its structure.

3. **Splitting the `Location of company` Column**:
   - The `Location of company` column is split into two new columns: `City` and `State`. This is done using the `str.split()` method, which separates the string based on a comma and assigns the first part to the `City` column and the second part to the `State` column.
   
4. **Handling Missing Data**:
   - The missing values in the `State` and `City` columns are checked using `isna().sum()`. The notebook identifies the rows where `State` is missing.
   - Cities with missing state values are processed by looking up the corresponding state from an external JSON file (`indian_states_and_cities.json`), which maps cities to their respective states.

5. **Mapping Cities to States**:
   - A custom function `map_city_to_state()` is defined to map each city to its respective state using the `city_to_state` dictionary loaded from the JSON file. This function handles missing city-state mappings by returning `None` when no match is found.

6. **Filling Missing State Values**:
   - For the rows where the `State` is missing, the notebook attempts to find the correct state by using the `map_city_to_state()` function and updates the `State` column with the found value.

7. **Final Check**:
   - After filling in the missing state values, `df.State.isna().sum()` is called again to verify that the missing values have been handled.

8. **Saving the Cleaned Data**:
   - Finally, the cleaned DataFrame is saved to a new CSV file, `Cleaned_Listofstartups.csv`, for further use.

---

### Purpose:
This notebook demonstrates basic data cleaning techniques, including handling missing data, text manipulation, and mapping data using external resources (in this case, a JSON file of cities and states). The cleaned dataset can now be used for further analysis or machine learning tasks.


In [33]:
import numpy as np
import pandas as pd

In [34]:
df = pd.read_csv('Listofstartups.csv')

In [35]:
df.describe()

Unnamed: 0,Incubation_Center,Name_of_startup,Location of company,Sector
count,241,241,239,241
unique,44,238,66,165
top,"SIIC, IIT Kanpur",MedCuore Medical Solutions Pvt Ltd,"Bengaluru, Karnataka",Healthcare
freq,17,2,31,34


In [36]:
df.head()

Unnamed: 0,Incubation_Center,Name_of_startup,Location of company,Sector
0,ABES Ghaziabad,Suryansh,New Delhi,EdTech
1,AIC Banasthali Vidyapith Foundation,Thinkpods Education Services Private Limited (...,"Satara, Maharashtra",Ed Tech
2,AIC Banasthali Vidyapith Foundation,Inventiway Solutions Pvt.Ltd.,"Mumbai, Maharashtra",HR Tech
3,AIC Banasthali Vidyapith Foundation,C2M Internet India Pvt. Ltd.,"Lucknow, Uttar Pradesh",Retail Tech
4,AIC Pinnacle Entrepreneurship Forum,Wastinno,"Pune, Maharashtra",agriculture


In [37]:
df[['City', 'State']] = df['Location of company'].str.split(',', expand=True, n=1)

In [38]:
df.head()

Unnamed: 0,Incubation_Center,Name_of_startup,Location of company,Sector,City,State
0,ABES Ghaziabad,Suryansh,New Delhi,EdTech,New Delhi,
1,AIC Banasthali Vidyapith Foundation,Thinkpods Education Services Private Limited (...,"Satara, Maharashtra",Ed Tech,Satara,Maharashtra
2,AIC Banasthali Vidyapith Foundation,Inventiway Solutions Pvt.Ltd.,"Mumbai, Maharashtra",HR Tech,Mumbai,Maharashtra
3,AIC Banasthali Vidyapith Foundation,C2M Internet India Pvt. Ltd.,"Lucknow, Uttar Pradesh",Retail Tech,Lucknow,Uttar Pradesh
4,AIC Pinnacle Entrepreneurship Forum,Wastinno,"Pune, Maharashtra",agriculture,Pune,Maharashtra


In [39]:
df.State.isna().sum()

np.int64(67)

In [40]:
df.City.isna().sum()

np.int64(2)

In [41]:
df.City = df.City.str.strip()
df.State = df.State.str.strip()

In [44]:
import json
with open('indian_states_and_cities.json') as f:
    city_to_state = json.load(f)

In [45]:
def map_city_to_state(city):
    city = city.title() if isinstance(city, str) else city
    for state, cities in city_to_state.items():
        if city in cities:
            return state
    return None

In [46]:
missing_state = df[df['State'].isna()]

In [47]:
for index, row in missing_state.iterrows():
    city = row['City']
    state = map_city_to_state(city)
    if state:
        df.at[index, 'State'] = state   

In [48]:
df.State.isna().sum()

np.int64(14)

In [49]:
df.head()

Unnamed: 0,Incubation_Center,Name_of_startup,Location of company,Sector,City,State
0,ABES Ghaziabad,Suryansh,New Delhi,EdTech,New Delhi,Delhi
1,AIC Banasthali Vidyapith Foundation,Thinkpods Education Services Private Limited (...,"Satara, Maharashtra",Ed Tech,Satara,Maharashtra
2,AIC Banasthali Vidyapith Foundation,Inventiway Solutions Pvt.Ltd.,"Mumbai, Maharashtra",HR Tech,Mumbai,Maharashtra
3,AIC Banasthali Vidyapith Foundation,C2M Internet India Pvt. Ltd.,"Lucknow, Uttar Pradesh",Retail Tech,Lucknow,Uttar Pradesh
4,AIC Pinnacle Entrepreneurship Forum,Wastinno,"Pune, Maharashtra",agriculture,Pune,Maharashtra


In [50]:
df.to_csv('Cleaned_Listofstartups.csv', index=False)