# Data Mining - Exploratory Data Analysis 🔎
**Authors:** [Melissa Perez](https://github.com/MelissaPerez09), [Adrian Flores](https://github.com/adrianRFlores), [Andrea Ramirez](https://github.com/Andrea-gt)

**Description:** Desc here

### Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as pearsonr
import scipy.stats as stats
import statsmodels.stats.diagnostic as diag

### Data Cleaning
This code performs data preprocessing on 'movies_df' in two main steps:

1. Conversion of string values in numeric columns ('castMenAmount' and 'castWomenAmount') to actual numerical values.

2. Identification and replacement of unrealistically high values in the 'castMenAmount' and 'castWomenAmount' columns with NaN (Not a Number).

In [4]:
# Read the data from the 'movies.csv' file and store it in a DataFrame named 'movies_df'.
movies_df = pd.read_csv('Data/movies.csv', encoding='unicode_escape')

# Convert string values in numeric columns to actual numbers. If conversion fails, replace with NaN.
movies_df[['castMenAmount', 'castWomenAmount']] = movies_df[['castMenAmount', 'castWomenAmount']].apply(pd.to_numeric, errors='coerce')

# Replace values in 'castMenAmount' and 'castWomenAmount' columns that are unrealistically high with NaN,
# as these values likely represent errors or outliers.
movies_df[['castMenAmount', 'castWomenAmount']] = np.where(movies_df[['castMenAmount', 'castWomenAmount']] > 1000, np.nan, movies_df[['castMenAmount', 'castWomenAmount']])

# Display the DataFrame to visualize the changes made.
movies_df

Unnamed: 0,id,budget,genres,homePage,productionCompany,productionCompanyCountry,productionCountry,revenue,runtime,video,...,popularity,releaseDate,voteAvg,voteCount,genresAmount,productionCoAmount,productionCountriesAmount,actorsAmount,castWomenAmount,castMenAmount
0,5,4000000,Crime|Comedy,https://www.miramax.com/movie/four-rooms/,Miramax|A Band Apart,US|US,United States of America,4257354.0,98,False,...,20.880,1995-12-09,5.7,2077,2,2,1,25,15.0,9.0
1,6,21000000,Action|Thriller|Crime,,Universal Pictures|Largo Entertainment|JVC,US|US|JP,Japan|United States of America,12136938.0,110,False,...,9.596,1993-10-15,6.5,223,3,3,2,15,3.0,9.0
2,11,11000000,Adventure|Action|Science Fiction,http://www.starwars.com/films/star-wars-episod...,Lucasfilm|20th Century Fox,US|US,United States of America,775398007.0,121,,...,100.003,1977-05-25,8.2,16598,3,2,1,105,5.0,62.0
3,12,94000000,Animation|Family,http://movies.disney.com/finding-nemo,Pixar,US,United States of America,940335536.0,100,,...,134.435,2003-05-30,7.8,15928,2,1,1,24,5.0,18.0
4,13,55000000,Comedy|Drama|Romance,,Paramount|The Steve Tisch Company,US|,United States of America,677387716.0,142,False,...,58.751,1994-07-06,8.5,22045,3,2,1,76,18.0,48.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,920081,0,Action|Horror,,,,,0.0,100,False,...,16.662,2021-11-26,6.8,108,2,1,1,10,2.0,4.0
9996,920143,0,Comedy,,Caracol Televisión|Dago García Producciones,CO|CO,Colombia,0.0,97,False,...,491.706,2021-12-25,1.5,2,1,2,1,8,1.0,1.0
9997,922017,0,Comedy,,,,Nigeria,0.0,112,False,...,565.658,2021-12-17,6.1,30,1,1,17,1,0.0,
9998,922162,0,,https://www.netflix.com/title/81425229,,,United States of America,0.0,59,False,...,9.664,2021-12-17,6.0,1,1,0,0,0,,


### Exercise I - Data Description
Performing a quick exploration by summarizing the dataset.

In [5]:
# Create a new DataFrame 'movies_df_describe' excluding the 'id' column.
movies_df_describe = movies_df.loc[:, movies_df.columns != 'id']

# Generate descriptive statistics for only the quantitative columns.
movies_df_describe.describe(include=[np.number])

Unnamed: 0,budget,revenue,runtime,popularity,voteAvg,voteCount,genresAmount,productionCoAmount,productionCountriesAmount,actorsAmount,castWomenAmount,castMenAmount
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,9838.0,9478.0
mean,18551630.0,56737930.0,100.2681,51.393907,6.48349,1342.3818,2.5965,3.1714,1.751,2147.6666,7.148201,14.119434
std,36626690.0,149585400.0,27.777829,216.729552,0.984274,2564.196637,1.154565,2.539738,3.012093,37200.075802,6.281767,13.131693
min,0.0,0.0,0.0,4.258,1.3,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,90.0,14.57775,5.9,120.0,2.0,2.0,1.0,13.0,3.0,6.0
50%,500000.0,163124.5,100.0,21.9055,6.5,415.0,3.0,3.0,1.0,21.0,6.0,11.0
75%,20000000.0,44796610.0,113.0,40.654,7.2,1316.0,3.0,4.0,2.0,36.0,9.0,19.0
max,380000000.0,2847246000.0,750.0,11474.647,10.0,30788.0,16.0,89.0,155.0,919590.0,106.0,626.0


### Exercise II - Variable Types


| Campo                               | Tipo de Variable       |
|:-----------------------------------:|:----------------------:|
| popularidad                         | Cuantitativa Continua  |
| presupuesto                         | Cuantitativa Continua  |
| ingresos                            | Cuantitativa Continua  |
| título_original                     | Cualitativa Nominal    |
| idioma_original                     | Cualitativa Nominal    |
| título                              | Cualitativa Nominal    |
| página_de_inicio                    | Cualitativa Nominal    |
| video                               | Cualitativa Nominal    |
| director                            | Cualitativa Nominal    |
| duración                            | Cuantitativa Continua  |
| géneros                             | Cualitativa Nominal    |
| cantidad_de_géneros                 | Cuantitativa Discreta  |
| compañía_de_producción              | Cualitativa Nominal    |
| cantidad_de_producciones            | Cuantitativa Discreta  |
| país_de_la_compañía_de_producción  | Cualitativa Nominal    |
| país_de_producción                  | Cualitativa Nominal    |
| cantidad_de_países_de_producción    | Cuantitativa Discreta  |
| fecha_de_lanzamiento                | Cualitativa Ordinal    |
| cantidad_de_votos                   | Cuantitativa Discreta  |
| promedio_de_votos                   | Cuantitativa Continua  |
| actores                             | Cualitativa Nominal    |
| popularidad_de_actores              | Cuantitativa Continua  |
| personaje_de_actores                | Cualitativa Nominal    |
| cantidad_de_actores                 | Cuantitativa Discreta  |
| cantidad_de_mujeres_del_elenco      | Cuantitativa Discreta  |
| cantidad_de_hombres_del_elenco      | Cuantitativa Discreta  |