In [53]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

## Kaggle datasets

In this notebook we are going to discuss the collection of dataset available on the kaggle platform.

The data is published under the CC0 license on [this kaggle page](https://www.kaggle.com/canggih/voted-kaggle-dataset) and we have adapted it for classes.

Let's get familiar with the content of the notebook and follow the instructions to prepare the data that we are going to need during classes.

**Note!** When you come back to the document later, remember to re-run the code cells.

### Dataset contents

The dataset contains information on:

*   **Title** - name of the dataset,
*   **Data Type** - format that the data is published in,
*   **License** - type of license regulating the use of the set,
*   **Votes** - number of votes for the set,
*   **Views** - number of views of the set,
*   **Downloads** - number of downloads of the set,
*   **Kernels** - number of kernels based on the set,
*   **Topics** - number of topics.







In [54]:
kaggle = pd.read_csv('./Preparation of the dataset/Kaggle Datasets/kaggle-ds.csv')
kaggle

Unnamed: 0,Title,Data Type,License,Votes,Views,Downloads,Kernels,Topics
0,Credit Card Fraud Detection,CSV,ODbL,1241,442136.0,53128.0,1782.0,26.0
1,European Soccer Database,SQLite,ODbL,1046,396214.0,46367.0,1459.0,75.0
2,TMDB 5000 Movie Dataset,CSV,Other,1024,446255.0,62002.0,1394.0,46.0
3,Global Terrorism Database,CSV,Other,789,187877.0,26309.0,608.0,11.0
4,Bitcoin Historical Data,CSV,CC4,618,146734.0,16868.0,68.0,13.0
...,...,...,...,...,...,...,...,...
2145,Fortnite: Battle Royale Chest Location Coordin...,CSV,Other,2,369.0,16.0,,0.0
2146,Titanic: Passenger Nationalities,CSV,Other,2,75.0,10.0,,0.0
2147,Stemmed and Lementized English words,{}JSON,ODbL,2,70.0,11.0,,0.0
2148,Los Angeles Weather During 2014,CSV,CC0,2,78.0,5.0,,0.0


## Notebook preparation

### Checking data types

In [55]:
kaggle.dtypes

Title         object
Data Type     object
License       object
Votes          int64
Views        float64
Downloads    float64
Kernels      float64
Topics       float64
dtype: object

In [56]:
"complete records: " + str(len(kaggle.dropna(how="any"))) + "; total records: " + str(len(kaggle))

'complete records: 845; total records: 2150'

## Dataset type conversion

In [57]:
kaggle["Data Type"].value_counts()

Data Type
CSV         1593
Other        468
{}JSON        56
SQLite        25
BigQuery       8
Name: count, dtype: int64

In [58]:
kaggle["Data Type"] = kaggle["Data Type"].astype("category")

### Converting licenses to categories

In [59]:
kaggle["License"].value_counts()

License
CC0      845
Other    751
CC4      324
ODbL     181
GPL       26
CC3       23
Name: count, dtype: int64

In [60]:
kaggle["License"] = kaggle["License"].astype("category")

### Numeric variables analysis

In [61]:
kaggle.describe()

Unnamed: 0,Votes,Views,Downloads,Kernels,Topics
count,2150.0,2145.0,2135.0,1206.0,1558.0
mean,24.011628,7299.300699,923.793443,38.392206,1.261874
std,64.788465,22660.139843,3098.5846,147.499168,3.58914
min,2.0,29.0,0.0,2.0,0.0
25%,4.0,750.0,60.0,3.0,0.0
50%,8.0,1930.0,187.0,7.0,0.0
75%,19.0,5151.0,602.0,21.0,2.0
max,1241.0,446255.0,62002.0,3394.0,75.0


## Exercises

**Exercise 1**

In [63]:
fig = px.scatter(
    data_frame=kaggle,
    x="Downloads",
    y="Views",
    color="License",
    size="Votes",
    size_max=50,
    color_discrete_sequence=px.colors.qualitative.Plotly,
    title="Downloads vs Views vs Votes",
    labels={"Downloads": "Number of Downloads", "Views": "Number of Views", "Votes": "Number of Votes", "License": "License Type"}
)

fig.update_traces(
    marker_line_width=1,
    marker_line_color='black'
)

fig.show()

**Exercise 2**

In [64]:
fig = px.scatter(
    data_frame=kaggle,
    x="Views", 
    y="Topics",
    color="Data Type",
    size="Votes",
    size_max=40,
    title="Relationship Between Views, Topics and Votes on Kaggle Datasets",
    hover_data=["Title", "Data Type"], 
)

fig.show()

In [65]:
fig = px.scatter(
    data_frame=kaggle,
    x="Views",
    y="Topics",
    color="Data Type",
    size="Votes",
    size_max=40, 
    title="Relationship Between Views, Topics, and Votes",
    hover_data=["Title", "Data Type"],
    color_discrete_sequence=px.colors.qualitative.D3,
)

fig.update_layout(
    title_x=0.5,
    title_font=dict(size=16),
    xaxis_title="Number of Views",
    yaxis_title="Number of Topics",
    font=dict(size=12),  
)

fig.show()

In [66]:
fig = px.colors.qualitative.swatches()
fig.show()