# Accessing and Cleaning Data in Data Frames

Now that we know how to create and store a Data Frame using Pandas, we will now focus on accessing and changing its data. Data cleaning accounts for a significant portion of a data scientist/analyst's task; usually, datasets (or portions of it) are unnecessarily large, redundant, and/or useless for a given task, so we clean them. This notebook goes over some basic data cleaning; more complex methods will be gone over in upcoming notebooks.

---

Our first task is to change the row indexing of a data frame. For example, in our previous notebook, we made a data frame for the top 10 happiest countries in the world. We have a column for the rankings from 1 to 10; there's also the leftmost column that indexes each row from 0 to 9. Check it out below:

In [1]:
import pandas as pd
loc = "../DataSets/Simple/top-ten-happy-countries-forbes.xlsx"
df = pd.read_excel(loc)
df

Unnamed: 0,Ranking,Country,Happy Score
0,1,Finland,7.632
1,2,Norway,7.594
2,3,Denmark,7.555
3,4,Iceland,7.495
4,5,Switzerland,7.487
5,6,Netherlands,7.441
6,7,Canada,7.382
7,8,New Zealand,7.324
8,9,Sweden,7.314
9,10,Australia,7.272


Let's modify the row indexes so that they display the ranking. We will do this by setting the `index` of our data frame to the "Ranking" column. Then, we use Python's `del` command to get rid of that "Ranking" column.

In [4]:
df.index = df["Ranking"]
del df["Ranking"]
df

Unnamed: 0_level_0,Country,Happy Score
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Finland,7.632
2,Norway,7.594
3,Denmark,7.555
4,Iceland,7.495
5,Switzerland,7.487
6,Netherlands,7.441
7,Canada,7.382
8,New Zealand,7.324
9,Sweden,7.314
10,Australia,7.272


---

Now, let's get rid of the `NULL` values in our dataset. These are values that have not been initialized; they represent nothing, and that can be a problem when you are trying to analyze your data. The best option is to get rid of them altogether! That is what we will do first. 

Let's work on the Museums dataset offered by Kaggle (access it [here](https://www.kaggle.com/imls/museum-directory)). This is an exhaustive list of pretty much all of the museums, aquariums, and zoos in the United States! Let's check it out.

In [4]:
mus_locate = "../Datasets/Kaggle/museums.csv"
mus_df = pd.read_csv(mus_locate)
mus_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Museum ID,Museum Name,Legal Name,Alternate Name,Museum Type,Institution Name,Street Address (Administrative Location),City (Administrative Location),State (Administrative Location),Zip Code (Administrative Location),...,Latitude,Longitude,Locale Code (NCES),County Code (FIPS),State Code (FIPS),Region Code (AAM),Employer ID Number,Tax Period,Income,Revenue
0,8400200098,ALASKA AVIATION HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,4721 AIRCRAFT DR,ANCHORAGE,AK,99502,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0
1,8400200117,ALASKA BOTANICAL GARDEN,ALASKA BOTANICAL GARDEN INC,,"ARBORETUM, BOTANICAL GARDEN, OR NATURE CENTER",,4601 CAMPBELL AIRSTRIP RD,ANCHORAGE,AK,99507,...,61.1689,-149.76708,4.0,20.0,2.0,6,920115504,201312.0,1379576.0,1323742.0
2,8400200153,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,ALASKA CHALLENGER CENTER FOR SPACE SCIENCE TEC...,,SCIENCE & TECHNOLOGY MUSEUM OR PLANETARIUM,,9711 KENAI SPUR HWY,KENAI,AK,99611,...,60.56149,-151.21598,3.0,122.0,2.0,6,921761906,201312.0,740030.0,729080.0
3,8400200143,ALASKA EDUCATORS HISTORICAL SOCIETY,ALASKA EDUCATORS HISTORICAL SOCIETY,,HISTORIC PRESERVATION,,214 BIRCH STREET,KENAI,AK,99611,...,60.5628,-151.26597,3.0,122.0,2.0,6,920165178,201412.0,0.0,0.0
4,8400200027,ALASKA HERITAGE MUSEUM,ALASKA AVIATION HERITAGE MUSEUM,,HISTORY MUSEUM,,301 W NORTHERN LIGHTS BLVD,ANCHORAGE,AK,99503,...,61.17925,-149.97254,1.0,20.0,2.0,6,920071852,201312.0,602912.0,550236.0


As you may have noticed, there are `NaN` values in the "Alternate Name" column, as well as the "Institution Name" column. We can get rid of them using the `dropna()` method. This is a Pandas method that gets rid of any row that has a `NaN` value in it!

WORK IN PROGRESS!!!