# Data Analysis with Pandas

In this notebook, I will be using Pandas to read Adult dataset and to perform some basic analysis to improve the understanding of the dataset.This Notebook demonstrates various processes of data analysis step-by-step.

### Goals:  
- Program in Python using Jupyter notebook
- Perform Data Analysis using Pandas
- Practice data pre-processing methods
- Analyze and summarize the dataset by finding facts from the data

#### Dataset:  
Adult - https://archive.ics.uci.edu/ml/datasets/Adult (https://archive.ics.uci.edu/ml/datasets/Adult)
- Please read Adult webpage carefully including Attribute Information section to familiarise yourself with the data and the data structure.
- To download data, click 'Data Folder' and select 'adult.data'. Save the data file as .csv file.
- Attributes: You will also need to see 'adult.names' for the attribute names. Insert a row at the top of the dataset and add attribute names to the respective columns.
- You will notice that the last column has no name. Name the last column as 'class-label'.


## Exploratory analysis: Loading and exploring the dataset

In this section we are going to gain more useful insights of the dataset. The first step is to read the dataset into a dataframe and perform basic analysis to gain more understanding of the dataset we are working on such as what are the maximum and minimum values in the dataset, whether there are NULL values that exist in the dataset,summarizing data using 'groupby' etc

In [1]:
#importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
#loading the dataset to a dataframe
data = pd.read_csv('C:/Users/25471/Desktop/DataMining_projects/datamining/adult.csv')

In [4]:
data.head()

Unnamed: 0,age,workclass,fblwgt,education,education-num,marital status,occupation,relationship,race,sex,capital gain,capital loss,hours per week,native country,class label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Here data.head() is used to display and examine the first five rows of the dataframe data.
The output shows a demographic respresentation of individuals.


- age: The age of each individual. It ranges from 28 to 53 in the given sample.
- workclass: The type of work class or employment status, such as "State-gov," "Self-emp-not-inc," and "Private."
- fblwgt: This column's meaning is unclear.
- education and education-num: These columns provide information about the individual's educational background. It includes values like "Bachelors," "HS-grad," and "11th." The education-num column represents the numerical representation of education levels.
- marital status: Describes the marital status of individuals, including categories like "Never-married," "Married-civ-spouse," and "Divorced."
- occupation: Indicates the occupation of each individual, such as "Adm-clerical," "Exec-managerial," and "Handlers-cleaners."
- relationship: Defines the relationship status of the individuals, such as "Not-in-family," "Husband," and "Wife."
- race: Specifies the racial background of the individuals, including categories like "White," "Black," and potentially others.
- sex: Indicates the gender of each individual, either "Male" or "Female."
- capital gain and capital loss: These columns provide information about capital gains and losses, although all the given samples have zero values in both columns.
- hours per week: Represents the number of hours worked per week, which is 40 for all the given samples.
- native country: Specifies the native country of the individuals, such as "United-States" and "Cuba."
- class label: Indicates the class label or income level, with the values "<=50K" suggesting an income less than or equal to 50,000 units.

#### Q1. Use head(2), head(10), tail(2). Explain your observations

In [5]:
data.head(2)

Unnamed: 0,age,workclass,fblwgt,education,education-num,marital status,occupation,relationship,race,sex,capital gain,capital loss,hours per week,native country,class label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [6]:
data.head(10)

Unnamed: 0,age,workclass,fblwgt,education,education-num,marital status,occupation,relationship,race,sex,capital gain,capital loss,hours per week,native country,class label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [7]:
data.tail(2)

Unnamed: 0,age,workclass,fblwgt,education,education-num,marital status,occupation,relationship,race,sex,capital gain,capital loss,hours per week,native country,class label
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


- Here head(2) displays the first two rows of the dataframe data.
- head(10) returns the first 10 rows of the dataframe.
- tail(2) displays the last two rows of the dataframe.

In [8]:
data.shape

(32561, 15)

data.shape is used to get the dimensionality of the dataframe data.This returns the number of rows (32561) and columns (15)

## Generating a unique dataset 
For this task we are going to generate a our version of dataset. To achieve this, we will use 48 in ramdom_state.

In [11]:
data = data.sample(n=30000, random_state = 48)

The sample() function is used to randomly select rows from a DataFrame.It selects a random sample of 30,000 rows from the dataset. 
The random sample will consistently provide the same results when the code is run again with the same seed if a specific value, in this example 48, is specified, hence ensures consistency.

In [12]:
data.shape

(30000, 15)

The dimensionality of the dataset has changed in number of rows to 30000 from 32561 with no changes in number of columns