# Accessing and Cleaning Data in Data Frames

Now that we know how to create and store a Data Frame using Pandas, we will now focus on accessing and changing its data. Data cleaning accounts for a significant portion of a data scientist/analyst's task; usually, datasets (or portions of it) are unnecessarily large, redundant, and/or useless for a given task, so we clean them. This notebook goes over some basic data cleaning; more complex methods will be gone over in upcoming notebooks.

---

Our first task is to change the row indexing of a data frame. For example, in our previous notebook, we made a data frame for the top 10 happiest countries in the world. We have a column for the rankings from 1 to 10; there's also the leftmost column that indexes each row from 0 to 9. Check it out below:

In [1]:
import pandas as pd
loc = "../DataSets/Simple/top-ten-happy-countries-forbes.xlsx"
df = pd.read_excel(loc)
df

Unnamed: 0,Ranking,Country,Happy Score
0,1,Finland,7.632
1,2,Norway,7.594
2,3,Denmark,7.555
3,4,Iceland,7.495
4,5,Switzerland,7.487
5,6,Netherlands,7.441
6,7,Canada,7.382
7,8,New Zealand,7.324
8,9,Sweden,7.314
9,10,Australia,7.272


Let's modify the row indexes so that they display the ranking. We will do this by setting the `index` of our data frame to the "Ranking" column. Then, we use Python's `del` command to get rid of that "Ranking" column.

In [4]:
df.index = df["Ranking"]
del df["Ranking"]
df

Unnamed: 0_level_0,Country,Happy Score
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Finland,7.632
2,Norway,7.594
3,Denmark,7.555
4,Iceland,7.495
5,Switzerland,7.487
6,Netherlands,7.441
7,Canada,7.382
8,New Zealand,7.324
9,Sweden,7.314
10,Australia,7.272


---

Now, let's get rid of the `NULL` values in our dataset. These are values that have not been initialized; they represent nothing, and that can be a problem when you are trying to analyze your data. The best option is to get rid of them altogether! That is what we will do first. 

Let's work on the Museums dataset offered by Kaggle (access it [here](https://www.kaggle.com/imls/museum-directory)). This is an exhaustive list of pretty much all of the museums, aquariums, and zoos in the United States! Let's check it out.

In [2]:
mus_locate = "../Datasets/Kaggle/museums.csv"
mus_df = pd.read_csv(mus_locate)
mus_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Museum ID,Museum Name,Legal Name,Alternate Name,Museum Type,Institution Name,Street Address (Administrative Location),City (Administrative Location),State (Administrative Location),Zip Code (Administrative Location),...,Latitude,Longitude,Locale Code (NCES),County Code (FIPS),State Code (FIPS),Region Code (AAM),Employer ID Number,Tax Period,Income,Revenue
0,8400200098,ALASKA AVIATION HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,4721 AIRCRAFT DR,ANCHORAGE,AK,99502,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0
1,8400200117,ALASKA BOTANICAL GARDEN,ALASKA BOTANICAL GARDEN INC,,"ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",,4601 CAMPBELL AIRSTRIP RD,ANCHORAGE,AK,99507,...,61.1689,-149.76708,4.0,20.0,2.0,6,920115504,201312.0,1379576.0,1323742.0
2,8400200153,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,,SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM,,9711 KENAI SPUR HWY,KENAI,AK,99611,...,60.56149,-151.21598,3.0,122.0,2.0,6,921761906,201312.0,740030.0,729080.0
3,8400200143,ALASKA EDUCATORS HISTORICAL SOCIETY,ALASKA EDUCATORS HISTORICAL SOCIETY,,HISTORIC PRESERVATION,,214 BIRCH STREET,KENAI,AK,99611,...,60.5628,-151.26597,3.0,122.0,2.0,6,920165178,201412.0,0.0,0.0
4,8400200027,ALASKA HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,301 W NORTHERN LIGHTS BLVD,ANCHORAGE,AK,99503,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0


As you may have noticed, there are `NaN` values in the "Alternate Name" column, as well as the "Institution Name" column. We can get rid of them using the `dropna()` method. This is a Pandas method that gets rid of any row that has a `NaN` value in it!

In [6]:
no_null_mus_df = mus_df.dropna()
no_null_mus_df

Unnamed: 0,Museum ID,Museum Name,Legal Name,Alternate Name,Museum Type,Institution Name,Street Address (Administrative Location),City (Administrative Location),State (Administrative Location),Zip Code (Administrative Location),...,Latitude,Longitude,Locale Code (NCES),County Code (FIPS),State Code (FIPS),Region Code (AAM),Employer ID Number,Tax Period,Income,Revenue


Wow, that just got rid of everything...It turns out that every row inside this dataset has a `NaN` value. Sometimes, datasets can be annoyingly incomplete, and its up to the data analyst (*you*) to figure out how to handle them. Okay, let's be a bit more specific. Can we get rid of the rows that have `NaN` in *only* the "Institution Name" column? This can be done by adding the parameter `SUBSET` to our `dropna()` function.

In [9]:
mus_df_modified = mus_df.dropna(subset=['Institution Name'])
mus_df_modified.head()

Unnamed: 0,Museum ID,Museum Name,Legal Name,Alternate Name,Museum Type,Institution Name,Street Address (Administrative Location),City (Administrative Location),State (Administrative Location),Zip Code (Administrative Location),...,Latitude,Longitude,Locale Code (NCES),County Code (FIPS),State Code (FIPS),Region Code (AAM),Employer ID Number,Tax Period,Income,Revenue
29,8409502704,ARC GALLERY,UNIVERSITY OF ALASKA,,ART MUSEUM,UNIVERSITY OF ALASKA ANCHORAGE,3211 PROVIDENCE DRIVE,ANCHORAGE,AK,99508,...,64.81712,-147.87837,2.0,90.0,2.0,6,926000147,201306.0,,
35,8409502921,CARR-GOTTSTEIN GALLERY,ALASKA PACIFIC UNIVERSITY,,ART MUSEUM,ALASKA PACIFIC UNIVERSITY,4101 UNIVERSITY DRIVE,ANCHORAGE,AK,99508,...,61.18889,-149.80347,1.0,20.0,2.0,6,920023588,201306.0,66512448.0,63692296.0
57,8409503302,GARY L. FREEBURG ART GALLERY,UNIVERSITY OF ALASKA,,ART MUSEUM,KENAI PENINSULA COLLEGE,3211 PROVIDENCE DRIVE,ANCHORAGE,AK,99508,...,64.81712,-147.87837,2.0,90.0,2.0,6,926000147,201306.0,,
60,8409502161,GEORGESON BOTANICAL GARDEN,UNIVERSITY OF ALASKA,,"ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",UNIVERSITY OF ALASKA FAIRBANKS,117 WEST TANANA DRIVE,FAIRBANKS,AK,99709,...,64.81712,-147.87837,2.0,90.0,2.0,6,926000147,201306.0,,
88,8409503530,KIMURA GALLERY,UNIVERSITY OF ALASKA,,ART MUSEUM,UNIVERSITY OF ALASKA ANCHORAGE,3211 PROVIDENCE DRIVE,ANCHORAGE,AK,99508,...,64.81712,-147.87837,2.0,90.0,2.0,6,926000147,201306.0,,


*Ooooh*, we got rid of all the `NaN` values within the "Institution Name" column. Now we know which museums are affiliated with a university or institution!

---

Now, another technique that data analysts use to handle `NaN` values, is to replace them with dummy values, or else the average of the other non-null values. Why get rid of a row just because it has one null value? Let's turn the null values in the "Income" column, into the mean values.

In [7]:
mus_df["Revenue"].fillna(mus_df["Revenue"].mean(), inplace = True)
mus_df.head(10)

Unnamed: 0,Museum ID,Museum Name,Legal Name,Alternate Name,Museum Type,Institution Name,Street Address (Administrative Location),City (Administrative Location),State (Administrative Location),Zip Code (Administrative Location),...,Latitude,Longitude,Locale Code (NCES),County Code (FIPS),State Code (FIPS),Region Code (AAM),Employer ID Number,Tax Period,Income,Revenue
0,8400200098,ALASKA AVIATION HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,4721 AIRCRAFT DR,ANCHORAGE,AK,99502,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0
1,8400200117,ALASKA BOTANICAL GARDEN,ALASKA BOTANICAL GARDEN INC,,"ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",,4601 CAMPBELL AIRSTRIP RD,ANCHORAGE,AK,99507,...,61.1689,-149.76708,4.0,20.0,2.0,6,920115504,201312.0,1379576.0,1323742.0
2,8400200153,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,,SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM,,9711 KENAI SPUR HWY,KENAI,AK,99611,...,60.56149,-151.21598,3.0,122.0,2.0,6,921761906,201312.0,740030.0,729080.0
3,8400200143,ALASKA EDUCATORS HISTORICAL SOCIETY,ALASKA EDUCATORS HISTORICAL SOCIETY,,HISTORIC PRESERVATION,,214 BIRCH STREET,KENAI,AK,99611,...,60.5628,-151.26597,3.0,122.0,2.0,6,920165178,201412.0,0.0,0.0
4,8400200027,ALASKA HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,301 W NORTHERN LIGHTS BLVD,ANCHORAGE,AK,99503,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0
5,8400200096,ALASKA HISTORICAL MUSEUM,ALASKA HISTORICAL MUSEUM INC,,HISTORIC PRESERVATION,,1675 E 5TH AVE,ANCHORAGE,AK,99501,...,61.21785,-149.85049,1.0,20.0,2.0,6,920062352,,,20976050.0
6,8400200078,ALASKA JEWISH MUSEUM,ALASKA JEWISH HISTORICAL MUSEUM AND CULTURAL C...,,GENERAL MUSEUM,,1117 E 35TH AVE,ANCHORAGE,AK,99508,...,61.18946,-149.86071,1.0,20.0,2.0,6,711010049,201312.0,2658938.0,34374.0
7,8400200084,ALASKA LIGHTHOUSE ASSOCIATION,ALASKA LIGHTHOUSE ASSOCIATION,,HISTORIC PRESERVATION,,2116 B 2ND ST,DOUGLAS,AK,99824,...,58.28299,-134.40583,3.0,110.0,2.0,6,911833974,201312.0,16500.0,16500.0
8,8400200107,ALASKA MASONIC LIBRARY AND MUSEUM FOUNDATION,ALASKA MASONIC LIBRARY AND MUSEUM FOUNDATION,,GENERAL MUSEUM,,PO BOX 190668,ANCHORAGE,AK,99519,...,61.21833,-149.89456,1.0,20.0,2.0,6,920095561,201406.0,0.0,0.0
9,8400200073,ALASKA MINING HALL OF FAME FOUNDATION,ALASKA MINING HALL OF FAME FOUNDATION,,HISTORY MUSEUM,,PO BOX 81906,FAIRBANKS,AK,99708,...,64.85079,-147.82945,2.0,90.0,2.0,6,550819611,201412.0,184295.0,31393.0


The "Revenue" value in Row 5 used to be `NaN`, but now, it is equal to the average of all the values in the column. Changing it from `NaN` to the mean value will not skew (or screw) up the data, because *it's the mean* (which is an unbiased number that doesn't sway to either side).

---

Our next task will be to filter out data given a certain condition. Let's say we only want the museums/parks with subtantial income. Let's define substantial income to be greater than 100000 dollars. Using the `drop()` function...

In [8]:
mus_df_good_income = mus_df[mus_df["Income"] > 100000]
mus_df_good_income.head()

Unnamed: 0,Museum ID,Museum Name,Legal Name,Alternate Name,Museum Type,Institution Name,Street Address (Administrative Location),City (Administrative Location),State (Administrative Location),Zip Code (Administrative Location),...,Latitude,Longitude,Locale Code (NCES),County Code (FIPS),State Code (FIPS),Region Code (AAM),Employer ID Number,Tax Period,Income,Revenue
0,8400200098,ALASKA AVIATION HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,4721 AIRCRAFT DR,ANCHORAGE,AK,99502,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0
1,8400200117,ALASKA BOTANICAL GARDEN,ALASKA BOTANICAL GARDEN INC,,"ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",,4601 CAMPBELL AIRSTRIP RD,ANCHORAGE,AK,99507,...,61.1689,-149.76708,4.0,20.0,2.0,6,920115504,201312.0,1379576.0,1323742.0
2,8400200153,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,,SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM,,9711 KENAI SPUR HWY,KENAI,AK,99611,...,60.56149,-151.21598,3.0,122.0,2.0,6,921761906,201312.0,740030.0,729080.0
4,8400200027,ALASKA HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,301 W NORTHERN LIGHTS BLVD,ANCHORAGE,AK,99503,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0
6,8400200078,ALASKA JEWISH MUSEUM,ALASKA JEWISH HISTORICAL MUSEUM AND CULTURAL C...,,GENERAL MUSEUM,,1117 E 35TH AVE,ANCHORAGE,AK,99508,...,61.18946,-149.86071,1.0,20.0,2.0,6,711010049,201312.0,2658938.0,34374.0


You can notice that all our rows have incomes greater than 100000 dollars. Sweet!

---

Our last task for this notebook, is to create a function that does the filtering and searching for us. How about this: let's create a column of 1s and 0s, that basically indicates if a museum generates more than 100000 dollars of income. First, we declare a methiod. Then, we use the `apply()` given with Pandas, to *apply* that method into our data frame!