# Parte 2 - Sección 5: Regresión lineal multiple

In [54]:
import pandas as pd

df = pd.read_csv('50_Startups.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


**Descripción del data set:** Nos dan la información contable (muy resumida) y geografica de 50 empresas sin saber su nombre.

**Descripción del problema:** Nos va a interesar predecir el profit de una empresa segun su información contable y su ubicación geografica.

In [55]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [56]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Column transformer: objeto que nos ayudara a convertir nuestra columna categoriga a una variable dummy
ct = ColumnTransformer(transformers= [('encoder', OneHotEncoder(), [3])], remainder= 'passthrough')

X = np.array(ct.fit_transform(X))
X = X[:, 1:]

In [57]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 0)

In [58]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X= X_train, y= y_train)
print('¡Modelo entrenado!')

¡Modelo entrenado!


In [59]:
y_pred = regressor.predict(X_test)

pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})

Unnamed: 0,y_test,y_pred
0,103282.38,103015.201598
1,144259.4,132582.277608
2,146121.95,132447.738452
3,77798.83,71976.098513
4,191050.39,178537.482211
5,105008.31,116161.242302
6,81229.06,67851.692097
7,97483.56,98791.733747
8,110352.25,113969.43533
9,166187.94,167921.065696


## Construir el modelo óptimo mediante la eliminación hacia atrás

In [60]:
import statsmodels.api as sm

X = np.append(arr= np.ones((50, 1)), values= X .astype(int), axis= 1)
X_opt = X.copy()

SL = 0.05 # Nivel de significacncia
num_vars = X_opt.shape[1]
for i in range(0, num_vars):
    regressor_OLS = sm.OLS(y, X_opt)
    regression = regressor_OLS.fit()
    
    max_p_value = max(regression.pvalues)
    
    if max_p_value > SL:
        for j in range(0, num_vars - i):
            if (regression.pvalues[j] == max_p_value):
                X_opt = np.delete(X_opt, j, 1)
                break
            
final_regression = sm.OLS(y, X_opt).fit()
final_regression.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Fri, 06 Jun 2025",Prob (F-statistic):,3.5000000000000004e-32
Time:,17:21:58,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.900,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.538
Skew:,-0.911,Prob(JB):,9.43e-05
Kurtosis:,5.361,Cond. No.,165000.0
