Exploración de Datos: Predicción de éxito para Startups
En este notebook, exploraremos los datos de startups.csv un dataset que reúne información de aproximadamente 923 startups de EE.UU. (empresas tecnológicas de diversas industrias) fundadas entre ~2005 y 2013, incluyendo características variadas sobre cada startup​

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configurar seaborn directamente
sns.set_theme()

# Configurar matplotlib para mostrar las gráficas en el notebook
plt.ion()

pd.set_option('display.max_columns', None)

In [2]:
# Cargamos el dataset
df = pd.read_csv('../data/startup data.csv')

# Mostramos el dataset
df.head()

# Información del dataset
df.info()

# Información estadística del dataset
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 49 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                923 non-null    int64  
 1   state_code                923 non-null    object 
 2   latitude                  923 non-null    float64
 3   longitude                 923 non-null    float64
 4   zip_code                  923 non-null    object 
 5   id                        923 non-null    object 
 6   city                      923 non-null    object 
 7   Unnamed: 6                430 non-null    object 
 8   name                      923 non-null    object 
 9   labels                    923 non-null    int64  
 10  founded_at                923 non-null    object 
 11  closed_at                 335 non-null    object 
 12  first_funding_at          923 non-null    object 
 13  last_funding_at           923 non-null    object 
 14  age_first_

Unnamed: 0.1,Unnamed: 0,latitude,longitude,labels,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,is_NY,is_MA,is_TX,is_otherstate,is_software,is_web,is_mobile,is_enterprise,is_advertising,is_gamesvideo,is_ecommerce,is_biotech,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500
count,923.0,923.0,923.0,923.0,923.0,923.0,771.0,771.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0
mean,572.297941,38.517442,-103.539212,0.646804,2.23563,3.931456,3.055353,4.754423,7.710726,2.310943,25419750.0,1.84182,0.527627,0.114843,0.089924,0.045504,0.221018,0.165764,0.156013,0.08559,0.07909,0.067172,0.056338,0.027086,0.036836,0.00325,0.32286,0.326111,0.254605,0.508126,0.392199,0.232936,0.099675,2.838586,0.809317
std,333.585431,3.741497,22.394167,0.478222,2.510449,2.96791,2.977057,3.212107,7.265776,1.390922,189634400.0,1.322632,0.499507,0.319005,0.286228,0.208519,0.415158,0.37207,0.363064,0.27991,0.270025,0.250456,0.230698,0.162421,0.188462,0.056949,0.467823,0.469042,0.435875,0.500205,0.488505,0.422931,0.299729,1.874601,0.393052
min,1.0,25.752358,-122.756956,0.0,-9.0466,-9.0466,-14.1699,-7.0055,0.0,1.0,11000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,283.5,37.388869,-122.198732,0.0,0.5767,1.66985,1.0,2.411,3.0,1.0,2725000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5,1.0
50%,577.0,37.779281,-118.374037,1.0,1.4466,3.5288,2.5205,4.4767,5.0,2.0,10000000.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.5,1.0
75%,866.5,40.730646,-77.214731,1.0,3.57535,5.56025,4.6863,6.7534,10.0,3.0,24725000.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.8,1.0
max,1153.0,59.335232,18.057121,1.0,21.8959,21.8959,24.6849,24.6849,63.0,10.0,5700000000.0,8.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,16.0,1.0


Análisis de calidad de los datos

In [3]:
# Calcular valores nulos
df.isnull().sum()

# Calculo de porcentajes de valores nulos
round(100*(df.isnull().sum()/len(df)),2)



Unnamed: 0                   0.00
state_code                   0.00
latitude                     0.00
longitude                    0.00
zip_code                     0.00
id                           0.00
city                         0.00
Unnamed: 6                  53.41
name                         0.00
labels                       0.00
founded_at                   0.00
closed_at                   63.71
first_funding_at             0.00
last_funding_at              0.00
age_first_funding_year       0.00
age_last_funding_year        0.00
age_first_milestone_year    16.47
age_last_milestone_year     16.47
relationships                0.00
funding_rounds               0.00
funding_total_usd            0.00
milestones                   0.00
state_code.1                 0.11
is_CA                        0.00
is_NY                        0.00
is_MA                        0.00
is_TX                        0.00
is_otherstate                0.00
category_code                0.00
is_software   

In [None]:
# Ahora 