# Challenge 1 -Básico

## Visualización de grandes bases de datos

### Eyder Uriel Kinil Cervera - Cógido 216910473


Este challenge introduce a los estudiantes a la visualización de grandes bases de datos utilizando PySpark y Koalas, un API que facilita el trabajo con grandes volúmenes de datos en un entorno de Spark pero con sintaxis similar a pandas.

-- Se reemplaza koalas por pandas

pyspark.pandas (anteriormente conocido como koalas) es una API que permite trabajar con datos distribuidos utilizando una interfaz muy similar a pandas. Es ideal cuando necesitas aprovechar la escalabilidad de Spark pero prefieres la sintaxis y las capacidades de pandas.

## Objetivo

Trabajar con la API de Koalas para ejecutar operaciones básicas de análisis de datos y visualizaciones simples utilizando PySpark (Matplotlib, Plotly, Seabons, pycaret)

In [16]:
#Importando liberías
import findspark
import pyspark
import pyspark.pandas as ps
import seaborn as sns
import matplotlib.pyplot as plt

In [17]:
#Ruta del csv
file_path = "../globalterrorismdb_0718dist.csv"

In [25]:
#Importando datos del csv con pyspark.pandas
psdf = ps.read_csv(file_path)



In [26]:
#Verificando tamaño de la base de datos con el dataframe de pyspark.pandas
psdf.shape

(181691, 135)

### Notas

Escalabilidad: pyspark.pandas maneja grandes bases de datos, si se trabajo bases de datos muy grandes se debe considerar usar directamente pyspark con spark.read.csv

Conversion a spark dataframe: se realiza para operaciones distribuidas mediante las siguientes operaciones.
- spark_df = df.to_spark()
- spark_df.printShema()

In [27]:
#Conversion a Spark Dataframe
spark_df = psdf.to_spark()

# Mostrar el esquema del DataFrame de Spark
spark_df.printSchema()

root
 |-- eventid: long (nullable = true)
 |-- iyear: integer (nullable = true)
 |-- imonth: integer (nullable = true)
 |-- iday: integer (nullable = true)
 |-- approxdate: string (nullable = true)
 |-- extended: integer (nullable = true)
 |-- resolution: string (nullable = true)
 |-- country: integer (nullable = true)
 |-- country_txt: string (nullable = true)
 |-- region: integer (nullable = true)
 |-- region_txt: string (nullable = true)
 |-- provstate: string (nullable = true)
 |-- city: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- specificity: integer (nullable = true)
 |-- vicinity: integer (nullable = true)
 |-- location: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- crit1: string (nullable = true)
 |-- crit2: string (nullable = true)
 |-- crit3: string (nullable = true)
 |-- doubtterr: string (nullable = true)
 |-- alternative: string (nullable = true)
 |-- alternative_txt: string (nullable



In [21]:
#Revisando estructura del dataframe
print(spark_df.dtypes)

[('eventid', 'bigint'), ('iyear', 'int'), ('imonth', 'int'), ('iday', 'int'), ('approxdate', 'string'), ('extended', 'int'), ('resolution', 'string'), ('country', 'int'), ('country_txt', 'string'), ('region', 'int'), ('region_txt', 'string'), ('provstate', 'string'), ('city', 'string'), ('latitude', 'double'), ('longitude', 'double'), ('specificity', 'int'), ('vicinity', 'int'), ('location', 'string'), ('summary', 'string'), ('crit1', 'string'), ('crit2', 'string'), ('crit3', 'string'), ('doubtterr', 'string'), ('alternative', 'string'), ('alternative_txt', 'string'), ('multiple', 'string'), ('success', 'string'), ('suicide', 'string'), ('attacktype1', 'string'), ('attacktype1_txt', 'string'), ('attacktype2', 'string'), ('attacktype2_txt', 'string'), ('attacktype3', 'string'), ('attacktype3_txt', 'string'), ('targtype1', 'string'), ('targtype1_txt', 'string'), ('targsubtype1', 'string'), ('targsubtype1_txt', 'string'), ('corp1', 'string'), ('target1', 'string'), ('natlty1', 'string'), 

### ¿Por qué usar pyspark.pandas?

- Escalabilidad: Permite trabajar con datos distribuidos que no caben en memoria, aprovechando Spark.

- Compatibilidad: La API es casi idéntica a pandas, por lo que es fácil de usar para quienes ya conocen pandas.

- Conversión sencilla: Facilita la conversión entre pyspark.pandas y los DataFrames estándar de PySpark.

### Trabajando con pyspark con la sintaxis de pandas

In [28]:
# Verificando primeras lineas 
psdf.head(5)

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,region_txt,provstate,city,latitude,longitude,specificity,vicinity,location,summary,crit1,crit2,crit3,doubtterr,alternative,alternative_txt,multiple,success,suicide,attacktype1,attacktype1_txt,attacktype2,attacktype2_txt,attacktype3,attacktype3_txt,targtype1,targtype1_txt,targsubtype1,targsubtype1_txt,corp1,target1,natlty1,natlty1_txt,targtype2,targtype2_txt,targsubtype2,targsubtype2_txt,corp2,target2,natlty2,natlty2_txt,targtype3,targtype3_txt,targsubtype3,targsubtype3_txt,corp3,target3,natlty3,natlty3_txt,gname,gsubname,gname2,gsubname2,gname3,gsubname3,motive,guncertain1,guncertain2,guncertain3,individual,nperps,nperpcap,claimed,claimmode,claimmode_txt,claim2,claimmode2,claimmode2_txt,claim3,claimmode3,claimmode3_txt,compclaim,weaptype1,weaptype1_txt,weapsubtype1,weapsubtype1_txt,weaptype2,weaptype2_txt,weapsubtype2,weapsubtype2_txt,weaptype3,weaptype3_txt,weapsubtype3,weapsubtype3_txt,weaptype4,weaptype4_txt,weapsubtype4,weapsubtype4_txt,weapdetail,nkill,nkillus,nkillter,nwound,nwoundus,nwoundte,property,propextent,propextent_txt,propvalue,propcomment,ishostkid,nhostkid,nhostkidus,nhours,ndays,divert,kidhijcountry,ransom,ransomamt,ransomamtus,ransompaid,ransompaidus,ransomnote,hostkidoutcome,hostkidoutcome_txt,nreleased,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,Central America & Caribbean,,Santo Domingo,18.456792,-69.951164,1,0,,,1,1,1,0,,,0,1,0,1,Assassination,,,,,14,Private Citizens & Property,68,Named Civilian,,Julio Guzman,58,Dominican Republic,,,,,,,,,,,,,,,,,MANO-D,,,,,,,0,,,0,,,,,,,,,,,,,13,Unknown,,,,,,,,,,,,,,,,1.0,,,0.0,,,0,,,,,0,,,,,,,0,,,,,,,,,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,North America,Federal,Mexico city,19.371887,-99.086624,1,0,,,1,1,1,0,,,0,1,0,6,Hostage Taking (Kidnapping),,,,,7,Government (Diplomatic),45,"Diplomatic Personnel (outside of embassy, cons...",Belgian Ambassador Daughter,"Nadine Chaval, daughter",21,Belgium,,,,,,,,,,,,,,,,,23rd of September Communist League,,,,,,,0,,,0,7.0,,,,,,,,,,,,13,Unknown,,,,,,,,,,,,,,,,0.0,,,0.0,,,0,,,,,1,1.0,0.0,,,,Mexico,1,800000.0,,,,,,,,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,Southeast Asia,Tarlac,Unknown,15.478598,120.599741,4,0,,,1,1,1,0,,,0,1,0,1,Assassination,,,,,10,Journalists & Media,54,Radio Journalist/Staff/Facility,Voice of America,Employee,217,United States,,,,,,,,,,,,,,,,,Unknown,,,,,,,0,,,0,,,,,,,,,,,,,13,Unknown,,,,,,,,,,,,,,,,1.0,,,0.0,,,0,,,,,0,,,,,,,0,,,,,,,,,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,Western Europe,Attica,Athens,37.99749,23.762728,1,0,,,1,1,1,0,,,0,1,0,3,Bombing/Explosion,,,,,7,Government (Diplomatic),46,Embassy/Consulate,,U.S. Embassy,217,United States,,,,,,,,,,,,,,,,,Unknown,,,,,,,0,,,0,,,,,,,,,,,,,6,Explosives,16.0,Unknown Explosive Type,,,,,,,,,,,,,Explosive,,,,,,,1,,,,,0,,,,,,,0,,,,,,,,,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,East Asia,Fukouka,Fukouka,33.580412,130.396361,1,0,,,1,1,1,-9,,,0,1,0,7,Facility/Infrastructure Attack,,,,,7,Government (Diplomatic),46,Embassy/Consulate,,U.S. Consulate,217,United States,,,,,,,,,,,,,,,,,Unknown,,,,,,,0,,,0,,,,,,,,,,,,,8,Incendiary,,,,,,,,,,,,,,,Incendiary,,,,,,,1,,,,,0,,,,,,,0,,,,,,,,,,,,,PGIS,-9,-9,1,1,


In [None]:
# Verificando tipo de datos
psdf.dtypes

eventid        int64
iyear          int32
imonth         int32
iday           int32
approxdate    object
               ...  
INT_LOG       object
INT_IDEO      object
INT_MISC      object
INT_ANY       object
related       object
Length: 135, dtype: object

In [36]:
# Estadistica descriptiva del dataframe 
print(psdf.describe())

            eventid          iyear         imonth           iday       extended        country         region       latitude     longitude    specificity       vicinity
count  1.816910e+05  181691.000000  181691.000000  181691.000000  181691.000000  181691.000000  181691.000000  177135.000000  1.771340e+05  181685.000000  181691.000000
mean   2.002705e+11    2002.638997       6.467277      15.505644       0.045346     131.968501       7.160938      23.498343 -4.586957e+02       1.451452       0.068297
std    1.325957e+09      13.259430       3.388303       8.814045       0.208063     112.414535       2.933408      18.569242  2.047790e+05       0.995430       0.284553
min    1.970000e+11    1970.000000       0.000000       0.000000       0.000000       4.000000       1.000000     -53.154613 -8.618590e+07       1.000000      -9.000000
25%    1.991021e+11    1991.000000       4.000000       8.000000       0.000000      78.000000       5.000000      11.509748  4.481776e+00       1.000000  