# Accessing and Cleaning Data in Data Frames

Now that we know how to create and store a Data Frame using Pandas, we will now focus on accessing and changing its data. Data cleaning accounts for a significant portion of a data scientist/analyst's task; usually, datasets (or portions of it) are unnecessarily large, redundant, and/or useless for a given task, so we clean them. This notebook goes over some basic data cleaning; more complex methods will be gone over in upcoming notebooks.

---

Our first task is to change the row indexing of a data frame. For example, in our previous notebook, we made a data frame for the top 10 happiest countries in the world. We have a column for the rankings from 1 to 10; there's also the leftmost column that indexes each row from 0 to 9. Check it out below:

In [1]:
import pandas as pd
loc = "../DataSets/Simple/top-ten-happy-countries-forbes.xlsx"
df = pd.read_excel(loc)
df

Unnamed: 0,Ranking,Country,Happy Score
0,1,Finland,7.632
1,2,Norway,7.594
2,3,Denmark,7.555
3,4,Iceland,7.495
4,5,Switzerland,7.487
5,6,Netherlands,7.441
6,7,Canada,7.382
7,8,New Zealand,7.324
8,9,Sweden,7.314
9,10,Australia,7.272


Let's modify the row indexes so that they display the ranking. We will do this by setting the `index` of our data frame to the "Ranking" column. Then, we use Python's `del` command to get rid of that "Ranking" column.

In [4]:
df.index = df["Ranking"]
del df["Ranking"]
df

Unnamed: 0_level_0,Country,Happy Score
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Finland,7.632
2,Norway,7.594
3,Denmark,7.555
4,Iceland,7.495
5,Switzerland,7.487
6,Netherlands,7.441
7,Canada,7.382
8,New Zealand,7.324
9,Sweden,7.314
10,Australia,7.272


---

Now, let's try **filtering** the dataset to have specific values. We can do this using the `drop()` command. For this, we will use the UFO Sightings dataset from Kaggle (can be found [here](https://www.kaggle.com/NUFORC/ufo-sightings/version/1)). 

In this data set, there is a column that shows for how long that UFO sighting lasted (in seconds). Let's try to filter out all "durations" less than 1000 seconds.

In [2]:
ufo_locate = "../Datasets/Kaggle/scrubbed.csv"
ufo_df = pd.read_csv(ufo_locate)
ufo_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


``` python
modified_ufo_df = ufo_df.drop(ufo_df[ufo_df["duration (seconds)"] < 1000])
modified_ufo_df.head()
```

WORK IN PROGRESS!!!