In [1]:
import pandas as pd

# dealing with corrupted values

## descriptions that are too short

In [2]:
df = pd.read_csv("../data/00_baseline/raw_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4246 entries, 0 to 4245
Data columns (total 4 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   startup_ID                   4246 non-null   int64 
 1   description_startupdetector  592 non-null    object
 2   startup_description          4112 non-null   object
 3   industry                     4246 non-null   object
dtypes: int64(1), object(3)
memory usage: 132.8+ KB


In [4]:
# merge descriptions as in prepare_dataset_csv.py
df['description'] = df['description_startupdetector'].fillna(df['startup_description'])
# sort for description length an look at the shortest ones
df['len_description'] = df['description'].apply(lambda x: len(x))
df_sort = df.sort_values('len_description', ascending=False)
df_sort.to_csv('../data/01_corrupted/further_data/full_sorted_length.csv', index=False)

from looking at the csv file we chose 47 as threshold length for descriptions. Below this threshold most descriptions seem to be mainly a chain of Keywords and not a proper description.

In [5]:
# remove all rows with descriptions shorter than 47
df.drop(df[df['len_description'] < 47].index, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3905 entries, 0 to 4245
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   startup_ID                   3905 non-null   int64 
 1   description_startupdetector  590 non-null    object
 2   startup_description          3772 non-null   object
 3   industry                     3905 non-null   object
 4   description                  3905 non-null   object
 5   len_description              3905 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 213.6+ KB


## removing missing values

although there are no NaN values in the description column we found descriptions like "unknown" or "no information" during data exploration. The goal is to find all synonyms for missing descriptions and remove them from the dataset. We will not remove incorrect descriptions in this step since they are considered "corrupted data"

Most of them are already handeled by the length threshold

## found "missing" values

found through string length:
- unknown
- unknow
- Unknown
- no infos
- no startup
- Placeholder
- no information
- no informations

found throguh duplicate descriptions:
- GROW by Pioniergarage Team 2020 - No description available.

the one from the duplicate descriptions is the only one longer than 47. This is why we will only remove rows with this description

In [6]:
missing_value = 'GROW by Pioniergarage Team 2020 - No description available.'

old_len = len(df)
# drop all rows with these descriptions
df = df.drop(df[df['description'] == missing_value].index)

print(f'removed {old_len - len(df)} "missing" values')

removed 9 "missing" values


## Further errors detected through manual checking

In [8]:
man_error = pd.read_csv('../data/01_corrupted/further_data/data_errors_manual_check.csv')
# get startup_IDs of Data Error entries
man_error['Data Error'] = man_error['Data Error'].fillna(False)
man_error = man_error.loc[man_error['Data Error']]
man_err_ids = man_error['startup_ID'].tolist()
len(man_err_ids)

27

In [9]:
old_len = len(df)
# drop rows if startup_ID is in list
df = df.drop(df[df['startup_ID'].isin(man_err_ids)].index)
print(f'droped {old_len - len(df)} corrupted rows')

droped 9 corrupted rows


In [22]:
# save new dataframe withoug corrupted rows
df = df.drop(columns=['description', 'len_description'])
df.to_csv('../data/01_corrupted/raw_data.csv', index=False)