## **PPG7_s1 - Analyzing the factors affecting car prices**
**Objective**
In this activity, you will analyze a dataset containing information about different cars. Your goal is to determine which factors influence car prices using **Linear Regression** and **ANOVA**.

**Dataset description**
The dataset contains the following variables:

1. **Price** (dependent variable) – The market price of the car in dollars.
2. **Kilometraje** (continuous independent variable) – The number of kilometers the car has traveled.
3. **Year** (continuous independent variable) – The year the car was manufactured.
4. **Brand** (categorical independent variable) – The car brand (Toyota, Ford, BMW, etc.).



**Tasks**
**1. Linear regression analysis**
- Fit a **multiple linear regression** model to predict the price of a car based on its **kilometraje** and **year**.
- Interpret the regression coefficients:
  - Does an increase in **kilometraje** decrease the price?
  - How does the **year** of manufacture affect the price?
- Analyze the **p-values** of the coefficients:
  - Are these variables statistically significant? (p-value < 0.05)



In [8]:
import pandas as pd
import statsmodels.api as sm

In [5]:
df = pd.read_csv('Dataset_de_Autos.csv')
print(df.head(5))

         Precio    Kilometraje   Año      Marca
0  27483.570765  176700.883666  2021  Chevrolet
1  24308.678494  150746.037373  2005       Ford
2  28238.442691  142432.990789  2010      Honda
3  32615.149282  143471.975958  2010   Mercedes
4  23829.233126   78303.318732  2016       Ford


In [6]:
# Definir variables independientes (Kilometraje y Año) y dependiente (Precio)
X = df[['Kilometraje', 'Año']]
y = df['Precio']

In [9]:
# Agregar constante para la regresión
X = sm.add_constant(X)

In [10]:
# Ajustar modelo de regresión lineal múltiple
modelo = sm.OLS(y, X).fit()

In [11]:
# Mostrar resumen del modelo
print(modelo.summary())

                            OLS Regression Results                            
Dep. Variable:                 Precio   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                 -0.006
Method:                 Least Squares   F-statistic:                    0.3599
Date:                Mon, 10 Mar 2025   Prob (F-statistic):              0.698
Time:                        10:54:48   Log-Likelihood:                -1972.1
No. Observations:                 200   AIC:                             3950.
Df Residuals:                     197   BIC:                             3960.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const       -5.804e+04   1.22e+05     -0.477      

Here we can see that the kilometraje coefficient is negative, this says to us that if we have a higher kilometraje then we have a lower price. 

Then we can see that the year coefficient is positive, this says to us that the new cars are more expensive than the olders.

Looking the p values we can see that kilometraje has a significant relation with price because p<0.05, then the year variable is not statistically significant because p>0.05

**2. ANOVA analysis**
- Perform an **ANOVA** test to check if the **brand** of a car has a significant impact on its price.
- State the null and alternative hypotheses:
  - **H₀:** All brands have the same average price.
  - **Hₐ:** At least one brand has a significantly different price.
- If the **p-value** is less than 0.05, conclude that brand influences car price.



In [12]:
import statsmodels.formula.api as smf

In [13]:
# Realizar un ANOVA para analizar el impacto de la marca en el precio
modelo_anova = smf.ols("Precio ~ C(Marca)", data=df).fit()
anova_resultado = sm.stats.anova_lm(modelo_anova, typ=2)


In [14]:
# Mostrar resultados del ANOVA
print("\nResultados del ANOVA:")
print(anova_resultado)


Resultados del ANOVA:
                sum_sq     df         F    PR(>F)
C(Marca)  5.741567e+07    5.0  0.523585  0.758257
Residual  4.254757e+09  194.0       NaN       NaN


In [17]:
# Extraer el valor p
p_value = anova_resultado["PR(>F)"][0]

print(p_value)

0.7582566382413112


In [16]:
# Conclusión basada en el p-value
if p_value < 0.05:
    print("\nConclusión: La marca del auto influye significativamente en el precio (p < 0.05).")
else:
    print("\nConclusión: No hay suficiente evidencia para decir que la marca afecta el precio (p >= 0.05).")


Conclusión: No hay suficiente evidencia para decir que la marca afecta el precio (p >= 0.05).


We cannot say that the brand of the car significantly influences in the price

**Deliverables**
- A summary of your findings, including regression coefficients, p-values, and ANOVA results.
- A brief interpretation of the results:
  - Which factors are most influential in determining car prices?
  - Does brand significantly affect price?
  - How reliable is the regression model?



**Bonus questions**
- What other factors might influence car prices that are not included in this dataset?
- If you were to improve this analysis, what additional data would you collect?