![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_blue.png?raw=true)

# 1.0 - Demographic Data Analyser

![Bar](https://raw.githubusercontent.com/AdmiJW/Items/master/SeperatingBars/Horizontalbar_blue.png)




### 1.1 - Traditional way: Fetching data and store in Python lists

The Non-pandas way to fetch the csv file is much more lengthy and we have to do it ourselves. Here I implemented a CSV file parser.

Parameters:
* `url` - The url to the csv file. Either a http url or local file path
* `delimiter` - **(Optional)** The delimiter used in csv file. Default value is comma `,`

Returns a List containing List items which consists of the values parsed.

In [3]:
import requests
from typing import *

def fetchCsv( url: str, delimiter: str = ',' ) -> List:
    #   Invalid delimiter of csv provided. Just raise an exception
    if ( len(delimiter) != 1 ):
        raise ValueError("Delimiter must be a single character!")

    #   Fetch the csv file at the provided url
    response = requests.get( url )

    #   Unable to fetch the csv file in provided url. Raise an exception for that
    if (response.status_code != 200 ):
        raise ConnectionError("Error: Unable to fetch the csv file at {}\n\
                status code: {}".format(url, response.status_code) )

    #   Obtain the text, split them into lines (Based on \r\n )
    responseText = response.text
    responseList = responseText.split('\r\n')

    #   Map each line into a list of values split by the delimiter specified.
    responseList = list( map( lambda s: s.split( delimiter ), responseList) )

    return responseList



Here is a test to test if the function is working correctly

In [6]:
li = fetchCsv('https://raw.githubusercontent.com/AdmiJW/Items/master/adult.data.csv')

for i in range(5):
    print( li[i] )

['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'salary']
['39', 'State-gov', '77516', 'Bachelors', '13', 'Never-married', 'Adm-clerical', 'Not-in-family', 'White', 'Male', '2174', '0', '40', 'United-States', '<=50K']
['50', 'Self-emp-not-inc', '83311', 'Bachelors', '13', 'Married-civ-spouse', 'Exec-managerial', 'Husband', 'White', 'Male', '0', '0', '13', 'United-States', '<=50K']
['38', 'Private', '215646', 'HS-grad', '9', 'Divorced', 'Handlers-cleaners', 'Not-in-family', 'White', 'Male', '0', '0', '40', 'United-States', '<=50K']
['53', 'Private', '234721', '11th', '7', 'Married-civ-spouse', 'Handlers-cleaners', 'Husband', 'Black', 'Male', '0', '0', '40', 'United-States', '<=50K']


![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_blue.png?raw=true)

### 1.2 - Panda's Way to Fetch CSV Files

`pandas` came with a handful functions to read data from various sources. One of which is `pandas.read_csv()` function. It will mostly return a ready to be processed `pandas.DataFrame` object.

Here I've written a toned-down version of the `panda`'s `read_csv()` function.

Parameters:
* `url` - The url to the csv file. Either a http request url or local file path
* `delimiter` - **(Optional)** The delimiter to be used to parse the csv file. Default is the comma `,`
* `header` - **(Optional)** Takes in boolean value. If True, the first row of csv is considered the header row, and thus will become the column name of the DataFrame. Otherwise no header will be parsed


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def pandasFetchCSV(url:str, delimiter:str = ',', header:bool = False):
    df:Union[pd.io.parsers.TextFileReader, pd.Series, pd.DataFrame, None] = pd.read_csv(
        url,
        sep=delimiter,
        header=0 if header else None)

    return df



We can quickly test the above function here:

In [12]:
df = pandasFetchCSV('https://raw.githubusercontent.com/AdmiJW/Items/master/adult.data.csv', header=True )

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_red.png?raw=true)

# 2.0 - Actual Fetching and Cleaning up of Data

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_red.png?raw=true)

### 2.1 - Fetching of the data

Here I'll use the more traditional way of fetching the data: via `requests` library (Function above).

Since the returned value is of type `List` containing `List` items of values, I need to pass it into the
`pandas` to create a `DataFrame` out of it.

In [5]:
listItems = fetchCsv('https://raw.githubusercontent.com/AdmiJW/Items/master/adult.data.csv')

# Use the DataFrame constructor, passing in the List
df = pd.DataFrame(listItems)

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_red.png?raw=true)

### 2.2 - Setting Name and Column Name (Header)

Now the data are fetched and stored in `DataFrame` object. Let's start by giving a name and header to the DataFrame



In [6]:

# Set the name of the DataFrame
df.name = 'Demographic Analysis Data'


In [7]:

# Set the column names of the DataFrame, which is on the first row
df.columns = df.iloc[0]
# Since the first row is now column name, the first row shall be dropped now
df.drop( 0, inplace = True )


![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_red.png?raw=true)

### 2.3 - Checking Info of the DataFrame

The data may have some basic problems in it. Let's check

* Column Datatypes
* Head
* Tail

In [8]:

# Let's obtain some info about our DataFrame now

df.info()

# Seems like all of the columns are of type object, which is very space consuming and hard to analyse!
# We shall deal with them one by one!

print('\n--------------------------------------------------\n')


# Actually, what 'object' is it actually? We need to use Python's native type() operator for that.

# By iterating through the column names, print the type of the values. We will use the first row as sample
# I don't use Series.dtype because it will just straight up return me 'Object'
for col in df.columns:
    print(col, 'is of type --------', type(df[col][1] ) )


<class 'pandas.core.frame.DataFrame'>
Int64Index: 32563 entries, 1 to 32563
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32563 non-null  object
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  object
 3   education       32561 non-null  object
 4   education-num   32561 non-null  object
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  object
 11  capital-loss    32561 non-null  object
 12  hours-per-week  32561 non-null  object
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: object(15)
memory usage: 2.1+ MB

--------------------------------------------------

age is of type -------- <class 'str'>
workclass i

In [9]:
# Let's see the head of the DataFrame

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
1,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
2,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
3,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
4,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
5,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
# And also the tail of the DataFrame

df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
32559,58.0,Private,151910.0,HS-grad,9.0,Widowed,Adm-clerical,Unmarried,White,Female,0.0,0.0,40.0,United-States,<=50K
32560,22.0,Private,201490.0,HS-grad,9.0,Never-married,Adm-clerical,Own-child,White,Male,0.0,0.0,20.0,United-States,<=50K
32561,52.0,Self-emp-inc,287927.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024.0,0.0,40.0,United-States,>50K
32562,,,,,,,,,,,,,,,
32563,,,,,,,,,,,,,,,


![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_red.png?raw=true)

### 2.3.1 - None rows in the end

We've see from the `df.tail()` that the last two rows of the DataFrame was all empty. Why was it so? Perhaps in the original csv file, it was just empty lines. Therefore, we have to clean it up

In [11]:
# Remove the last 2 rows which is Empty rows
df = df.iloc[:-2]
df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32560,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32561,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_red.png?raw=true)

### 2.3.2 - Incorrect Datatypes

Let's set those columns to have the correct datatypes

---

Here I've wrote a home made function which take in a Series as argument. The Series will contain strings, but supposedly
contain integer values.
This function will try to parse the values into int. It will detect if any values cannot be converted into integers.
In addition, it can also tell you the maximum value and minimum value, as well as the indices of invalid values if there exists
in the Series

Parameters:
* `series` - The `pandas.Series` object
* `can_negative` - **(Optional)** Boolean value. Can there be negative values in the Series? If False, as soon as a negative value is met, it will became invalid Series

Returns:
* It shall return a List, containing
   * Index 0: Boolean value telling if the Series can become integer given condition
   * Index 1: A List. What the list contains depend on index 0
         * If index 0 is True (The Series is valid integer) --- [ min, max ]
         * If index 0 is False (The Series is invalid integer) --- The list will contain indexes of which value is invalid

In [42]:


# This is a custom function which take in a Series of string, which supposed to be integers.
# It will check for each value, if it is ALL valid integers. This is to safely parse into correct datatype


def checkStringIntValues(series: pd.Series, can_negative:bool = False ) -> List:
    
    result = [ True, list() ]
    vmax = -float('inf')
    vmin = float('inf')

    for i in series.index:
        try:
            num = int( series[i] )
            
            if not can_negative and num < 0:
                raise ValueError("Negative values")
            
            vmax = max(vmax, num)
            vmin = min(vmin, num)
        except:
            result[0] = False
            result[1].append(i)
            
    if len(result[1] ) == 0:
        result[1].append(vmin)
        result[1].append(vmax)
        
    return result
            
    
    



In [43]:
###########################################
# age column > int8 (Since age really don't go high)
###########################################

# Using own made function, check the validity of the age column
[valid, maxmin] = checkStringIntValues( df.age )

print( valid )       # True
print( maxmin )      # [17, 90]


# Since highest value is only 90, we can safely use int8 as datatype
df['age'] = df['age'].astype( np.int8 )



True
[17, 90]


In [44]:
###########################################
# workclass column. Let's check values first
###########################################

print( df.workclass.unique() )

# Datatype is string alright. However there is a ? hanging there. Unknown work class.
# Let's just ignore this value (Not counted into data analysis)

['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked']


In [45]:

###########################################
# fnlwgt column. Let's see if there's any invalid numeric values
###########################################

[valid, minmax] = checkStringIntValues( df.fnlwgt )

print(valid)        # True
print(minmax)       # [12285, 1484705]


# Now we see values go up as high as 1484705. int32 shall be suitable

df['fnlwgt'] = df['fnlwgt'].astype( np.int32 )

True
[12285, 1484705]


In [46]:

###########################################
# education column. Let's see unique values
###########################################
df['education'].unique()

# No invalid values. All good. Moving on

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [47]:

############################################
# education-num column. Let's see
###########################################

[valid, minmax] = checkStringIntValues( df['education-num'] )

print( valid )           # True
print( minmax )          # [1,16]


# Only as low as 16. Use int8
df['education-num'] = df['education-num'].astype( np.int8 )

True
[1, 16]


In [48]:

############################################
# marital-status column. Let's see
###########################################

df['marital-status'].unique()

# All is good. Continue on

array(['Never-married', 'Married-civ-spouse', 'Divorced',
       'Married-spouse-absent', 'Separated', 'Married-AF-spouse',
       'Widowed'], dtype=object)

In [49]:

############################################
# occupation column. Let's see
###########################################

df.occupation.unique()

# Take note of the '?' unknown value. Ignore that and continue on

array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',
       'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',
       'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',
       'Tech-support', '?', 'Protective-serv', 'Armed-Forces',
       'Priv-house-serv'], dtype=object)

In [50]:

############################################
# relationship column. Let's see
###########################################

df.relationship.unique()

# Alright

array(['Not-in-family', 'Husband', 'Wife', 'Own-child', 'Unmarried',
       'Other-relative'], dtype=object)

In [51]:

############################################
# race column. Let's see
###########################################

df.race.unique()

# OK

array(['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo',
       'Other'], dtype=object)

In [52]:

############################################
# sex column. Let's see
###########################################
import sys
df.sex.unique()

# No problem here


array(['Male', 'Female'], dtype=object)

In [53]:

############################################
# capital-gain column. Let's see
###########################################

[valid, minmax] = checkStringIntValues( df['capital-gain'] )

print(valid)       # True
print(minmax)      # [0, 99999]

# A int32 will suffice

df['capital-gain'] = df['capital-gain'].astype( np.int32 )

True
[0, 99999]


In [54]:
############################################
# capital-loss column. Let's see
###########################################

[valid, minmax] = checkStringIntValues( df['capital-loss'] )

print(valid)         # True
print(minmax)        # [0, 4356]

# In this case maximum is only 4356, well within range of int16

df['capital-loss'] = df['capital-loss'].astype( np.int16 )

True
[0, 4356]


In [58]:
############################################
# hours-per-week column. Let's see
###########################################

[valid, minmax] = checkStringIntValues( df['hours-per-week'] )

print(valid)          # True
print(minmax)         # [1,99]

# int8 is already more than enoguh

df['hours-per-week'] = df['hours-per-week'].astype( np.int8 )

True
[1, 99]


In [61]:
############################################
# native-country column. Let's see
###########################################

df['native-country'].unique()

df['native-country'].value_counts()

# Just take note that None values exist in the data, and 583 of them

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [64]:
############################################
# salary column. Let's see
###########################################

# Since salary is not in the form of pure numbers, we check unique values

df.salary.unique()

# Just <=50K and >50K? OK


array(['<=50K', '>50K'], dtype=object)

![Bar](https://raw.githubusercontent.com/AdmiJW/Items/master/SeperatingBars/Horizontalbar_red.png)

Seems like all the datatypes are now correct, and as for invalid values, we can just safely ignore them.

Let's continue into the **ACTUAL** data analysis now. Action time!

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

# 3.0 Data Analsysis

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.1 - Problem #1

Q: _How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)_

In [113]:

# This problem asks for a frequency table. This is how we actually check for
# number of unique values! Use Series.value_counts() to get the result!

Q1_ans = df['race'].value_counts()

Q1_ans


White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.2 - Problem #2

Q: _What is the average age of men?_

In [183]:

# This problem asks for average age of men. 
# Unfortunately, pandas Series does not have a average function. However,
# there is instead a sum() function. Use sum() divided by the number of entries

male_mask = df.sex == 'Male'

Q2_ans = df[ male_mask ].age.sum() / len( df[ male_mask] )

Q2_ans



39.43354749885268

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.3 - Problem #3

Q: _What is the percentage of people who have a Bachelor's degree?_

In [115]:

# This problem asks for number of people with 'Bachelors' value in its education
# column, over all people, in percentage

sample = len( df[ df.education == 'Bachelors'] )
Q3_ans = sample / len(df) * 100

Q3_ans

16.44605509658794

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.4 - Problem #4

Q: _What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?_

In [116]:

# This is a compounded query. We have to find
# Those people which has advanced education
# Those people which makes more than 50K

# Basically, it is ( advanced edu & 50K ) / (50K)

# To get boolean mask whether values is in a specified values list, use
# isin() function

advanced_edu_values = ['Bachelors', 'Masters', 'Doctorate']
advanced_edu_mask = df.education.isin( advanced_edu_values )

over_50K_mask = df.salary == '>50K'


Q4_ans = len( df[ (advanced_edu_mask) & (over_50K_mask) ] ) \
         / len( df[advanced_edu_mask]) * 100

Q4_ans

46.535843011613935

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.5 - Problem #5

Q: _What percentage of people without advanced education make more than 50K?_

In [123]:

# This problem requires
# >    People with no advanced education. Just negate the last mask
# >    People making more than 50K. Just use the last mask

advanced_edu_values = ['Bachelors', 'Masters', 'Doctorate']
advanced_edu_mask = df.education.isin( advanced_edu_values )
no_advanced_edu_mask = ~advanced_edu_mask

Q5_ans = len( df[ (no_advanced_edu_mask) & (over_50K_mask) ] ) \
        /len( df[ (no_advanced_edu_mask) ] ) * 100

Q5_ans



17.3713601914639

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.6 - Problem #6

Q: _What is the minimum number of hours a person works per week?_

In [122]:

# Just get the minimum of the hours-per-week column

# Summary statistics
print ( df['hours-per-week'].describe()  )

Q6_ans = df['hours-per-week'].min()

Q6_ans

count    32561.000000
mean        40.437456
std         12.347429
min          1.000000
25%         40.000000
50%         40.000000
75%         45.000000
max         99.000000
Name: hours-per-week, dtype: float64


1

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.7 - Problem #7

Q: _What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?_

In [126]:

# We will use the result of last query on this.
# We need the mask of:
# >    People working at minimum hour per week
# >    People working at salary more than 50K

min_hour = df['hours-per-week'].min()
min_mask = df['hours-per-week'] == min_hour

salary_mask = df.salary == '>50K'

Q7_ans = len( df[ min_mask & salary_mask ] ) / len( df[ min_mask] ) * 100

Q7_ans

10.0

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.8 - Problem #8

Q: _What country has the highest percentage of people that earn >50K and what is the percentage?_

In [174]:

# This is a more complex query. We will approach this in series of steps
# 1.    We need a frequency table for people in each country
# 2.    We need a frequency table for people in each country earning
#       >50K.
# 3.    Divide those corresponding datas together. We will get percentage from
#       it 

Nonefilter = df['native-country'] != '?'

pop_country = df[Nonefilter]['native-country'].value_counts()

salary_mask = df[Nonefilter].salary == '>50K'

pop_country_rich = df[Nonefilter][salary_mask]['native-country'].value_counts()

print( pop_country) 
print( pop_country_rich)

country_percentage = pop_country_rich / pop_country * 100
country_percentage.sort_values( ascending=False, inplace=True)

Q8_1_ans = country_percentage.index[0];
Q8_2_ans = country_percentage.max()

print( Q8_1_ans, Q8_2_ans )


United-States                 29170
Mexico                          643
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                           29
France                      

41.86046511627907

![Bar](https://github.com/AdmiJW/Items/blob/master/SeperatingBars/Horizontalbar_green.png?raw=true)

### 3.9- Problem #9

Q: _Identify the most popular occupation for those who earn >50K in India_

In [166]:

# We need:
# >    Salary mask for those >50K
# >    Country mask for India
# >    Count values for occupation. Get the top value

salary_mask = df.salary == '>50K'

country_mask = df['native-country'] == 'India'

occupations_series = df[ salary_mask & country_mask ].occupation

print( occupations_series.value_counts() )

Q9_ans = occupations_series.value_counts().index[0]

Q9_ans

Prof-specialty      25
Exec-managerial      8
Other-service        2
Tech-support         2
Sales                1
Adm-clerical         1
Transport-moving     1
Name: occupation, dtype: int64


'Prof-specialty'