**DATA CLEANING IN PYTHON**

There are three questions we should ask ourselves before we do any formal analysis: 

1) Are we confident that we know the full range of values, and the shape of the distribution, of 
every variable of interest?

 2) Do we have a good idea of the bivariate relationship between 
variables, how each moves with others? 

3) How successful are our attempts to fix potential 
problems, such as outliers and missing values?

**Subsetting data**

In [6]:
import pandas as pd
import numpy as np
pd.set_option('display.width', 75) # This sets the maximum width for displaying columns in Pandas
pd.set_option('display.max_columns', 8) # This limits the number of columns displayed in a DataFrame to 5.
pd.set_option('display.max_rows', 20) # This sets the maximum number of rows displayed in a DataFrame to 20.
pd.options.display.float_format = '{:,.2f}'.format # It ensures that floating-point values are displayed with two decimal places.

**Importing the data**

In [7]:
nls97 = pd.read_csv("R:\\Data cleaning and exploration in python\\Data-Cleaning-and-Exploration-with-Machine-Learning-main\\1. DataCleaningandMachineLearningAlgorithms\\data\\nls97.csv")
nls97.set_index("personid", inplace=True)
print(nls97)

          gender  birthmonth  birthyear  highestgradecompleted  ...  \
personid                                                        ...   
100061    Female           5       1980                  13.00  ...   
100139      Male           9       1983                  12.00  ...   
100284      Male          11       1984                   7.00  ...   
100292      Male           4       1982                    NaN  ...   
100583      Male           1       1980                  13.00  ...   
...          ...         ...        ...                    ...  ...   
999291    Female           4       1981                  16.00  ...   
999406      Male           7       1982                  14.00  ...   
999543    Female           8       1984                  12.00  ...   
999698    Female           5       1983                  12.00  ...   
999963    Female           9       1982                  17.00  ...   

              colenrfeb16      colenroct16      colenrfeb17  \
personid     

**SELECTING A FEW COLUMNS FROM THE DATASET**

In [4]:
democols = ['gender','birthyear','maritalstatus',
 'weeksworked16','wageincome','highestdegree']

In [8]:
nls97demo = nls97[democols]
nls97demo.index.name

'personid'

**USING SLICING TO SELECT ROWS BY POSITION**

In [10]:
nls97demo[1000:1004]

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
195884,Male,1981,,,,4. Bachelors
195891,Male,1980,Never-married,53.0,14000.0,2. High School
195970,Female,1982,Never-married,53.0,52000.0,4. Bachelors
195996,Female,1980,,,,3. Associates


**SKIPPING ROWS BY AN INTERVAL**

In [11]:
nls97demo[1000:1004:2]

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
195884,Male,1981,,,,4. Bachelors
195970,Female,1982,Never-married,53.0,52000.0,4. Bachelors


**FIRST THREE COLUMNS**

In [12]:
nls97demo[:3]

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100061,Female,1980,Married,48.0,12500.0,2. High School
100139,Male,1983,Married,53.0,120000.0,2. High School
100284,Male,1984,Never-married,47.0,58000.0,0. None


In [13]:
nls97demo.head(3)

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100061,Female,1980,Married,48.0,12500.0,2. High School
100139,Male,1983,Married,53.0,120000.0,2. High School
100284,Male,1984,Never-married,47.0,58000.0,0. None


In [14]:
nls97demo[-3:] # shows the last n rows of the dataset

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
999543,Female,1984,Divorced,0.0,,2. High School
999698,Female,1983,Never-married,0.0,,2. High School
999963,Female,1982,Married,53.0,50000.0,4. Bachelors


In [15]:
nls97demo.tail(3)

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
999543,Female,1984,Divorced,0.0,,2. High School
999698,Female,1983,Never-married,0.0,,2. High School
999963,Female,1982,Married,53.0,50000.0,4. Bachelors


**SELECTING ROWS BY INDEX VALUE USING THE LOC ACCESSOR**

In [16]:
nls97demo.loc[[195884,195891,195970]]

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
195884,Male,1981,,,,4. Bachelors
195891,Male,1980,Never-married,53.0,14000.0,2. High School
195970,Female,1982,Never-married,53.0,52000.0,4. Bachelors


In [17]:
nls97demo.loc[195884:195970]

Unnamed: 0_level_0,gender,birthyear,maritalstatus,weeksworked16,wageincome,highestdegree
personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
195884,Male,1981,,,,4. Bachelors
195891,Male,1980,Never-married,53.0,14000.0,2. High School
195970,Female,1982,Never-married,53.0,52000.0,4. Bachelors
