<a href="https://colab.research.google.com/github/UsserJack/Data_Analysis_A/blob/main/Demographic_Data_Analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Datos a usar: "Adult Data"

Analizar datos demográficos usando Pandas. Se proporciona un conjunto de datos demográficos extraídos de la base de datos del censo de 1994. Aquí hay una muestra de cómo se ven los datos:

```
|    |   age | workclass        |   fnlwgt | education   |   education-num | marital-status     | occupation        | relationship   | race   | sex    |   capital-gain |   capital-loss |   hours-per-week | native-country   | salary   |
|---:|------:|:-----------------|---------:|:------------|----------------:|:-------------------|:------------------|:---------------|:-------|:-------|---------------:|---------------:|-----------------:|:-----------------|:---------|
|  0 |    39 | State-gov        |    77516 | Bachelors   |              13 | Never-married      | Adm-clerical      | Not-in-family  | White  | Male   |           2174 |              0 |               40 | United-States    | <=50K    |
|  1 |    50 | Self-emp-not-inc |    83311 | Bachelors   |              13 | Married-civ-spouse | Exec-managerial   | Husband        | White  | Male   |              0 |              0 |               13 | United-States    | <=50K    |
|  2 |    38 | Private          |   215646 | HS-grad     |               9 | Divorced           | Handlers-cleaners | Not-in-family  | White  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  3 |    53 | Private          |   234721 | 11th        |               7 | Married-civ-spouse | Handlers-cleaners | Husband        | Black  | Male   |              0 |              0 |               40 | United-States    | <=50K    |
|  4 |    28 | Private          |   338409 | Bachelors   |              13 | Married-civ-spouse | Prof-specialty    | Wife           | Black  | Female |              0 |              0 |               40 | Cuba             | <=50K    |
```
Debes usar Pandas para responder las siguientes preguntas:
* ¿Cuántas personas de cada raza están representadas en este conjunto de datos? 
* ¿Cuál es la edad promedio de los hombres?
* ¿Cuál es el porcentaje de personas que tienen una licenciatura?
* ¿Qué porcentaje de personas con educación avanzada (Licenciatura, Maestría o Doctorado) ganan más de 50K?
* ¿Qué porcentaje de personas sin educación avanzada ganan más de 50K?
* ¿Cuál es el número mínimo de horas que una persona trabaja por semana?
* ¿Qué porcentaje de las personas que trabajan el número mínimo de horas por semana tienen un salario de más de 50K?
* ¿Qué país tiene el porcentaje más alto de personas que ganan >50K y cuál es ese porcentaje?
* Identifique la ocupación más popular para aquellos que ganan >50K en India.

In [2]:
import pandas as pd

In [3]:
def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df['race'].value_counts()

    # What is the average age of men?
    average_age_men = round(df.loc[df['sex'] == 'Male','age'].mean(),1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round((df.loc[df['education'] == 'Bachelors',].shape[0]) / (df['education'].shape[0]) * 100,1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = round(df.loc[(df['education'] == 'Bachelors') 
                              | (df['education'] =='Masters') 
                              | (df['education'] =='Doctorate')].shape[0] 
                              / df['education'].shape[0] * 100,1)
    
    lower_education = round(df.loc[(df['education'] != 'Bachelors') 
                              & (df['education'] != 'Masters') 
                              & (df['education'] !='Doctorate')].shape[0] 
                              / df['education'].shape[0] * 100,1)

    # percentage with salary >50K
    higher_education_rich = round(df.loc[((df['education'] == 'Bachelors') 
                              | (df['education'] =='Masters') 
                              | (df['education'] =='Doctorate')) 
                              & (df['salary'] == '>50K')].shape[0] 
                              / df.loc[(df['education'] == 'Bachelors') 
                              | (df['education'] =='Masters') 
                              | (df['education'] =='Doctorate')].shape[0] * 100,1)
    
    lower_education_rich = round(df.loc[((df['education'] != 'Bachelors') 
                              & (df['education'] !='Masters') 
                              & (df['education'] !='Doctorate')) 
                              & (df['salary'] == '>50K')].shape[0] 
                              / df.loc[(df['education'] != 'Bachelors') 
                              & (df['education'] !='Masters') 
                              & (df['education'] !='Doctorate')].shape[0] * 100,1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = round(df['hours-per-week'].min(),1)

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = round(df.loc[df['hours-per-week'] == df['hours-per-week'].min()].shape[0], 1)

    rich_percentage = round(df.loc[(df['hours-per-week'] == df['hours-per-week'].min()) 
                              & (df['salary'] == '>50K')].shape[0] 
                              / df.loc[df['hours-per-week'] 
                              == df['hours-per-week'].min()].shape[0] * 100,1)

    # What country has the highest percentage of people that earn >50K?
    highest_earning_country = (df.loc[df['salary'] == '>50K'].groupby('native-country')['salary'].count() 
                                / df.groupby('native-country')['salary'].count() * 100).idxmax()
    
    highest_earning_country_percentage = round((df.loc[df['salary'] == '>50K'].groupby('native-country')['salary'].count() 
                                                / df.groupby('native-country')['salary'].count() * 100).max(),1)

    # Identify the most popular occupation for those who earn >50K in India.
    top_IN_occupation = df.loc[(df['native-country'] == 'India') & (df['salary'] == '>50K')]['occupation'].value_counts().idxmax()

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

In [4]:
calculate_demographic_data()

Number of each race:
 White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64
Average age of men: 39.4
Percentage with Bachelors degrees: 16.4%
Percentage with higher education that earn >50K: 46.5%
Percentage without higher education that earn >50K: 17.4%
Min work time: 1 hours/week
Percentage of rich among those who work fewest hours: 10.0%
Country with highest percentage of rich: Iran
Highest percentage of rich people in country: 41.9%
Top occupations in India: Prof-specialty


{'race_count': White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
 Name: race, dtype: int64,
 'average_age_men': 39.4,
 'percentage_bachelors': 16.4,
 'higher_education_rich': 46.5,
 'lower_education_rich': 17.4,
 'min_work_hours': 1,
 'rich_percentage': 10.0,
 'highest_earning_country': 'Iran',
 'highest_earning_country_percentage': 41.9,
 'top_IN_occupation': 'Prof-specialty'}

In [7]:
df = pd.read_csv('adult.data.csv')

In [8]:
race_count = df['race'].value_counts()
race_count

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

In [9]:
average_age_men = round(df.loc[df['sex'] == 'Male','age'].mean(),1)
average_age_men

39.4

In [10]:
percentage_bachelors = round((df.loc[df['education'] == 'Bachelors',].shape[0]) / (df['education'].shape[0]) * 100,1)
percentage_bachelors

16.4

In [11]:
higher_education = round(df.loc[(df['education'] == 'Bachelors') 
                              | (df['education'] =='Masters') 
                              | (df['education'] =='Doctorate')].shape[0] 
                              / df['education'].shape[0] * 100,1)
higher_education

23.0

In [12]:
lower_education = round(df.loc[(df['education'] != 'Bachelors') 
                              & (df['education'] != 'Masters') 
                              & (df['education'] !='Doctorate')].shape[0] 
                              / df['education'].shape[0] * 100,1)
lower_education

77.0

In [13]:
higher_education_rich = round(df.loc[((df['education'] == 'Bachelors') 
                              | (df['education'] =='Masters') 
                              | (df['education'] =='Doctorate')) 
                              & (df['salary'] == '>50K')].shape[0] 
                              / df.loc[(df['education'] == 'Bachelors') 
                              | (df['education'] =='Masters') 
                              | (df['education'] =='Doctorate')].shape[0] * 100,1)
higher_education_rich

46.5

In [14]:
lower_education_rich = round(df.loc[((df['education'] != 'Bachelors') 
                              & (df['education'] !='Masters') 
                              & (df['education'] !='Doctorate')) 
                              & (df['salary'] == '>50K')].shape[0] 
                              / df.loc[(df['education'] != 'Bachelors') 
                              & (df['education'] !='Masters') 
                              & (df['education'] !='Doctorate')].shape[0] * 100,1)
lower_education_rich

17.4

In [15]:
min_work_hours = round(df['hours-per-week'].min(),1)
min_work_hours

1

In [16]:
num_min_workers = round(df.loc[df['hours-per-week'] == df['hours-per-week'].min()].shape[0], 1)
num_min_workers

20

In [17]:
rich_percentage = round(df.loc[(df['hours-per-week'] == df['hours-per-week'].min()) 
                              & (df['salary'] == '>50K')].shape[0] 
                              / df.loc[df['hours-per-week'] 
                              == df['hours-per-week'].min()].shape[0] * 100,1)
rich_percentage

10.0

In [18]:
highest_earning_country = (df.loc[df['salary'] == '>50K'].groupby('native-country')['salary'].count() 
                                / df.groupby('native-country')['salary'].count() * 100).idxmax()
highest_earning_country

'Iran'

In [19]:
highest_earning_country_percentage = round((df.loc[df['salary'] == '>50K'].groupby('native-country')['salary'].count() 
                                                / df.groupby('native-country')['salary'].count() * 100).max(),1)
highest_earning_country_percentage

41.9

In [21]:
top_IN_occupation = df.loc[(df['native-country'] == 'India') & (df['salary'] == '>50K')]['occupation'].value_counts().idxmax()
top_IN_occupation

'Prof-specialty'