Inpired from the Kaggle post: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

1. Understand the problem
2. Univariable study
3. Multivariate study
4. Basic cleaning
5. Test assumption 

Download the data from https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world

And export them by adapting the variable PATH_TO_DATA

In [10]:

# Import libraries and data
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from decouple import config
import pandas as pd

PATH_TO_DATA = config("PATH_TO_DATA")
data = pd.read_csv(PATH_TO_DATA)

## 1. Understand the problem
Here we want to check our understanding of the raw data as well as which questions we are trying to answer with them. Main questions are: 
- What is the meaning of each variable/column in the real world?
- Can potentially interesting variables be generated from the existing ones?
- When and how were the data collected?
- Is the sample size enough to draw generalizations? 
- What are the research/business questions that can be tackled based on those data? 

Let's consider each of those aspects with the following first steps of data exploration.

In [17]:
data.info()
print("")
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Language         100 non-null    object 
 1   Total Speakers   100 non-null    int64  
 2   Native Speakers  96 non-null     float64
 3   Origin           100 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 3.2+ KB



Unnamed: 0,Language,Total Speakers,Native Speakers,Origin
0,English,1132366680,379007140.0,Indo-European
1,Mandarin Chinese,1116596640,917868640.0,Sino-Tibetan
2,Hindi,615475540,341208640.0,Indo-European
3,Spanish,534335730,460093030.0,Indo-European
4,French,279821930,77177210.0,Indo-European


We identify 4 variables whom 2 numerical and 2 categorical for a total of 100 samples. Apparently few or no missing data as indicated by the "Non-Null Count". Regarding the meaning of each variables:
- **Language**: object type containing the English name of each language 
- **Total Speakers**: int type measuring the number of speakers for each language which inludes both native and non-native speakers
- **Native Speakers**: float type representing the number of people whom the mother tongue is the associated language on the same row
- **Origin**: object type describing the [family origin group of each language](https://en.wikipedia.org/wiki/Language_family)

We noticed that the simple transformation $Total\;Speakers - Native\;Speakers$ will give us access to the number of non-native speakers to extend the analysis. From the [metadata](https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world/metadata) of original dataset desription, we deduce that the data have been measured in 2019 and were extracted from the website [Ethnologue](https://www.ethnologue.com/). Finally, knowing that [around 6,500 languages are spoken in the world in 2021](https://blog.busuu.com/most-spoken-languages-in-the-world/), we should recognize that this database take into account only $0.015\%$ of the spoken languages. Nevertheless, to realize the relative importance of those first 100 languages spoken in the world, we will normalize them with the world population of 2019 which was of [7.7 billion](https://www.un.org/development/desa/publications/world-population-prospects-2019-highlights.html). 

Finally, the resulting dataset would consist of:
- **Language**
- **Total Speakers**
- **Native Speakers**
- **Origin**
- **Non-Native Speakers**
- **Percentage of the world population**

Using those information, we are able to consider the following questions:
- What is the top 10 most spoken languages by native people in 2019? And by non-native?
- How much percentage of the world population are speaking at least one of the top 10 languages as native or non-native? What about for the top 100 languages?
- What is the origin of the top 10 most spoken languages? 


 