# Imports

In [1]:
#basic data analysis
import numpy as np
import pandas as pd

In [10]:
#import dataset
startup = pd.read_csv("C:/Users/barbe/OneDrive/DS/Projects/Startup Failure Prediction/startup_failure_prediction.csv")

# Initial Look

First step in any project is to take a look at the dataset.

In [3]:
#look at first few instances
startup.head()

Unnamed: 0,Startup_Name,Industry,Startup_Age,Funding_Amount,Number_of_Founders,Founder_Experience,Employees_Count,Revenue,Burn_Rate,Market_Size,Business_Model,Product_Uniqueness_Score,Customer_Retention_Rate,Marketing_Expense,Startup_Status
0,Startup_1,Logistics,8,18328419,2,13,581,97866143,602731,Medium,B2B,2,79.61,987830,1
1,Startup_2,Education,3,39753708,3,16,529,36868744,820698,Large,B2C,3,32.47,599615,1
2,Startup_3,Healthcare,14,18073294,1,28,82,3478737,992205,Small,B2C,1,9.88,780730,1
3,Startup_4,E-commerce,5,19435653,4,14,234,80716899,536747,Medium,B2C,9,23.2,188588,1
4,Startup_5,Finance,14,4205797,4,17,960,53347246,555199,Medium,Hybrid,3,73.52,310892,1


In [5]:
#look at last few instances
startup.tail()

Unnamed: 0,Startup_Name,Industry,Startup_Age,Funding_Amount,Number_of_Founders,Founder_Experience,Employees_Count,Revenue,Burn_Rate,Market_Size,Business_Model,Product_Uniqueness_Score,Customer_Retention_Rate,Marketing_Expense,Startup_Status
4995,Startup_4996,Education,9,39609909,4,13,324,16090631,971733,Small,B2B,7,21.56,568646,1
4996,Startup_4997,AI/ML,7,32167694,3,17,673,72563900,148322,Large,B2C,5,89.58,466261,1
4997,Startup_4998,Logistics,13,38902583,4,14,253,66629971,487704,Medium,B2C,6,17.21,468514,1
4998,Startup_4999,Education,2,33410138,4,16,417,69293233,723237,Medium,B2B,9,39.35,134880,1
4999,Startup_5000,Healthcare,1,42246335,1,29,708,53330079,962039,Large,B2C,7,16.32,217733,1


In [6]:
#look at dimension of dataset
print(startup.shape)

(5000, 15)


Looking at the first few and last few instances, here is what I've noted:
- The dataset consists of 5000 instances, 14 features, 1 response variable.
- The dataset is mainly composed of numerical values, some of which are on different scales.
- There are a few categorical vairables, one of which does have a natural order while the others do not.
- The variable Startup_Name can be dropped.

# Data Cleaning

#### Data Types

We'll begin by checking the data types

In [7]:
#look at data types
print(startup.dtypes)

Startup_Name                 object
Industry                     object
Startup_Age                   int64
Funding_Amount                int64
Number_of_Founders            int64
Founder_Experience            int64
Employees_Count               int64
Revenue                       int64
Burn_Rate                     int64
Market_Size                  object
Business_Model               object
Product_Uniqueness_Score      int64
Customer_Retention_Rate     float64
Marketing_Expense             int64
Startup_Status                int64
dtype: object


We can convert the object type to category type:
- Category types use less memory than object.
- Some ML algoriths process category types better and faster than object type
- This signals that the data is either nominal or ordinal.

In [8]:
#convert object variables to categorical variables
for col in startup.select_dtypes(include = ['object']).columns:
    startup[col] = startup[col].astype('category')

#look at data types
startup.dtypes

Startup_Name                category
Industry                    category
Startup_Age                    int64
Funding_Amount                 int64
Number_of_Founders             int64
Founder_Experience             int64
Employees_Count                int64
Revenue                        int64
Burn_Rate                      int64
Market_Size                 category
Business_Model              category
Product_Uniqueness_Score       int64
Customer_Retention_Rate      float64
Marketing_Expense              int64
Startup_Status                 int64
dtype: object

#### Missing Values and Duplicates

In [9]:
#check for missing values
print('Check for missing values:')
print(startup.isna().sum(), '\n')

#check for duplicates
print('Check for duplicates: \n', startup.duplicated().sum())

Check for missing values:
Startup_Name                0
Industry                    0
Startup_Age                 0
Funding_Amount              0
Number_of_Founders          0
Founder_Experience          0
Employees_Count             0
Revenue                     0
Burn_Rate                   0
Market_Size                 0
Business_Model              0
Product_Uniqueness_Score    0
Customer_Retention_Rate     0
Marketing_Expense           0
Startup_Status              0
dtype: int64 

Check for duplicates: 
 0


No missing values and no duplicates.

#### Remove unwanted Variables

There's only one variable that isn't needed: Startup_Name. It acts as a sort of index, which the pandas dataframe does nicely already.

In [12]:
#remove column Startup_Name
startup = startup.drop(columns=['Startup_Name'])

#look at dataset
startup.head()

Unnamed: 0,Industry,Startup_Age,Funding_Amount,Number_of_Founders,Founder_Experience,Employees_Count,Revenue,Burn_Rate,Market_Size,Business_Model,Product_Uniqueness_Score,Customer_Retention_Rate,Marketing_Expense,Startup_Status
0,Logistics,8,18328419,2,13,581,97866143,602731,Medium,B2B,2,79.61,987830,1
1,Education,3,39753708,3,16,529,36868744,820698,Large,B2C,3,32.47,599615,1
2,Healthcare,14,18073294,1,28,82,3478737,992205,Small,B2C,1,9.88,780730,1
3,E-commerce,5,19435653,4,14,234,80716899,536747,Medium,B2C,9,23.2,188588,1
4,Finance,14,4205797,4,17,960,53347246,555199,Medium,Hybrid,3,73.52,310892,1


#### Outliers

Next step is to check for outliers.