In this section, we will modify the CSV to obtain a version that meets our needs.

Let's start by removing the columns that are not relevant to us.

In [61]:
#Importations 

import pandas as pd


In [62]:
# Original File
df = pd.read_csv("Start-Up_initial.csv")

df.drop(columns = ["permalink", "name","homepage_url","region","state_code","city"], inplace = True)

df.to_csv("Start_Up.csv", index=False)


Now that we have only the necessary columns, we will also modify them to make them usable.

### **Column Overview:**
- **category_list**: Contains the industry sector of the start-up.  
- **funding_total_usd**: Contains the total value of funds raised.  
- **status**: Indicates the status of the start-up.  
- **country_code**: The country where the start-up is established.  
- **funding_rounds**: The number of funding rounds.  
- **founded_at**: The date of creation.  
- **first_funding_at**: The date of the first funding round.  
- **last_funding_at**: The date of the last funding round.  

We will now process these columns to ensure they are correctly formatted and usable.

The **category_list** column contains a large number of categories, with multiple categories assigned to some start-ups. To ensure a coherent analysis, we will:  

1. **Reduce the number of categories per start-up to just one.**  
2. **Simplify and consolidate categories to reduce overall category count.**  

This will allow for a clearer and more structured analysis of the data. Let's proceed with these modifications.

In [63]:
# We access the modified file
df = pd.read_csv("Start_Up.csv")

# We create a row for each category of every Start-Up in order to count the number of occurrences of each category.
df_exploded = df.assign(category_list=df["category_list"].str.split("|")).explode("category_list")

# We fill the missed information
df_exploded["category_list"] = df_exploded["category_list"].fillna("unknown").astype(str)

df["category_list"] = df["category_list"].fillna("unknown").astype(str)

category_counts = df_exploded["category_list"].value_counts()

# We only keep the most represented category for eache Start-UP
df["category_list"] = df["category_list"].apply(
    lambda x: max(x.split("|"), key=lambda cat: category_counts.get(cat, 0))
)

# Each Start-Up that had have a category that is represented less than 100 times in the dataset goes into an "Other" section
seuil = 100

category_counts = df["category_list"].value_counts()

df["category_list"] = df["category_list"].apply(lambda x: x if category_counts[x] >= seuil else "Other")



In [64]:
print("#of categories" , df["category_list"].nunique())

print(df["category_list"].value_counts())

#of categories 66
category_list
Software           8768
Mobile             4989
Other              4943
Biotechnology      4544
E-Commerce         3769
                   ... 
Cloud Computing     121
Nonprofits          118
Nanotechnology      111
Digital Media       103
Recruiting          103
Name: count, Length: 66, dtype: int64


We now have a total of 66 categories each containing at least 100 Start-Up, this column is now usable for analysis and modeling.

All the others columns are already good to use.

In [65]:
df.to_csv("final_Start-Up.csv")