# Inspecting Data with Pandas

### 1. Import needed package(s)
- At the start of any notebook or script we do all of the import statements.
- In this case we are not only importing the library but also giving it an alias `pd`
- The alias is used in order to save time while coding

In [76]:
import pandas as pd

### 2. Reading Data

Pandas has a function to load in a `.csv` or other textbased files as a DataFrame. This week `.txt` files are used in the project

In [5]:
df_1880 = pd.read_csv('../data/yob1880.txt')


### 3. Attributes and Methods

Let's use the `.head()` *method* to take a look at the first 5 lines in the dataframe. It can be called on any DataFrame object by the dot, followed by the method name and parentheses. By default it returns the first 5 rows of the DataFrame. 

In [77]:
df_1880.head()

Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


Something is a bit off with our dataframe. What could it be?
- Although the intention of the `.head()` method is to show the first 5 rows, 6 are show here!
- This is because the first row was expect to be the `header` of the dataframe when in fact it is a row of data also called an observation
- Most datasets will come with the header included. If not the a header can be passed when reading in the data

When reading in the data add the parameter `names` to the command with a list of the column names

In [78]:
df_1880 = pd.read_csv('../data/yob1880.txt', names=['name', 'gender', 'frequency'])

Now using the `.head()` method we can see the header and the data in the first 5 rows of the dataframe.

In [79]:
 df_1880.head()

Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


`.shape` is an *attribute*. It can be used with any dataset using a dot. It shows the number of rows and columns in a DataFrame as a Python *tuple*:

In [80]:
df_1880.shape

(2000, 3)

Notice that methods have `()` and attributes do not. Attributes tell us something about the object. In this case the object is a dataframe with population statistics. The shape attribute tells use the number of rows and columns. Methods do something to the object and can also be give parameters. As you can see the default to the `.head()` method is to show 5 rows. We can pass a different number such as 3 and it will only show 3 rows. 

In [81]:
df_1880.head(3)

Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003


### 4. Access to index and column names

In [82]:
df_1880.index

RangeIndex(start=0, stop=2000, step=1)

This tells us that the index is a numeric index from 0 to 2000 using a stepwise value of 1

In [83]:
df_1880.columns

Index(['name', 'gender', 'frequency'], dtype='object')

`.columns` returns the column names as a *list-like* Series. We can access each element as we would in a python list:

In [14]:
df_1880.columns[0]


'name'

### 5. Information about each column

- `.info()` give us the non-Null count of each column and the data type of each column.
- Null would imply that there is no data in a cell also referred to a NaN, this is considered **Missing Data**

In [15]:
df_1880.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       2000 non-null   object
 1   gender     2000 non-null   object
 2   frequency  2000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 47.0+ KB


### 6. Pandas `.to_csv()` parameters

- Since changes have been made to the data it would be clever to save them in case we want to use the data in this format at another point in time.
- This can be done with the `.to_csv` method
- `.to_csv()` writes the data to a user choosen file name and path

In [16]:
df_1880.to_csv('../df_1880.csv')


Check to see what the data looks like when it is read back into the notebook.

In [18]:
df_1880_new = pd.read_csv('../df_1880.csv')
df_1880_new.head()


Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


What happened? Why this there now an additional 'unnamed' column?

- When writing the data the index was also written to the `.csv` file
- This can be avoided by setting the `.to_csv()` parameter to `index=False`

In [84]:
df_1880.to_csv('../df_1880.csv', index=False)

In [85]:
df_1880 = pd.read_csv('../df_1880.csv')
df_1880.head()


Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


### 7. Try again on your own with the year 2000

We currently have a dataframe for the year 1880 ready to use. Let's also read in one for the year 2000. Do you remember how to read in a dataframe with pandas?

**Note:** replace the ??? with the correct pandas commands and make sure the dataframe as a relevant header.

In [19]:
df_2000 = pd.read_csv('../data/yob2000.txt')

To ensure the data has been read in correctly it is always clever to look at the first few rows of the dataframe. How did we do that?

In [20]:
df_2000.head()

Unnamed: 0,Emily,F,25957
0,Hannah,F,23084
1,Madison,F,19968
2,Ashley,F,17997
3,Sarah,F,17706
4,Alexis,F,17630


Perform the same steps as were perfromed on the 1880 dataset to save the dataset as a `.csv` file and with the correct header (column names)

In [21]:
df_2000 = pd.read_csv('../data/yob2000.txt', names =['name', 'gender', 'frequency'])


In [42]:
df_2000.index




RangeIndex(start=0, stop=29777, step=1)

In [22]:
df_2000.columns

Index(['name', 'gender', 'frequency'], dtype='object')

In [46]:
df_2000.columns[0]


'name'

In [23]:
df_2000.info

<bound method DataFrame.info of           name gender  frequency
0        Emily      F      25957
1       Hannah      F      23084
2      Madison      F      19968
3       Ashley      F      17997
4        Sarah      F      17706
...        ...    ...        ...
29772     Zeph      M          5
29773    Zeven      M          5
29774    Ziggy      M          5
29775       Zo      M          5
29776    Zyier      M          5

[29777 rows x 3 columns]>

In [24]:
df_2000.to_csv('./df_2000.csv', index = False)

In [25]:
df_2000_new = pd.read_csv('./df_2000.csv')
df_2000_new.head()

Unnamed: 0,name,gender,frequency
0,Emily,F,25957
1,Hannah,F,23084
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17706


## Short review

Using the df_2000 dataframe try to use the correct command to give you the following results

In [27]:
# 1. display the DataFrame
df_2000

Unnamed: 0,name,gender,frequency
0,Emily,F,25957
1,Hannah,F,23084
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17706
...,...,...,...
29772,Zeph,M,5
29773,Zeven,M,5
29774,Ziggy,M,5
29775,Zo,M,5


In [28]:
# 2. display the first 5 rows
df_2000.head()

Unnamed: 0,name,gender,frequency
0,Emily,F,25957
1,Hannah,F,23084
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17706


In [34]:
# 3. display the last 5 rows
df_2000.tail()


Unnamed: 0,name,gender,frequency
29772,Zeph,M,5
29773,Zeven,M,5
29774,Ziggy,M,5
29775,Zo,M,5
29776,Zyier,M,5


In [35]:
# 4.2. number or rows and columns
df_2000.shape

(29777, 3)

In [33]:
# 5. list the column names
df_2000.columns

Index(['name', 'gender', 'frequency'], dtype='object')

In [37]:
# 6. list the row index
df_2000.index

RangeIndex(start=0, stop=29777, step=1)

In [38]:
# 7. display the column types
df_2000.dtypes

name         object
gender       object
frequency     int64
dtype: object

In [50]:
pwd


'C:\\Users\\MarinaBishay\\Desktop\\spiced\\lemongrass-regression-student-code\\week_02_pandas\\weekly_project\\notebook'

In [54]:
def parse_dataset(year):
    year = input(f'Please input {year}')
    df = pd.read_csv(f'../data/yob{year}.txt', names = ["name", "gender", "frequency"])
    return df
    
year = 1880


parse_dataset(1880)


Please input 1880 1880


Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746
...,...,...,...
1995,Woodie,M,5
1996,Worthy,M,5
1997,Wright,M,5
1998,York,M,5


In [70]:
pwd

'C:\\Users\\MarinaBishay\\Desktop\\spiced\\lemongrass-regression-student-code\\week_02_pandas\\weekly_project\\notebook'

In [94]:
# 1. Read in data:
df = pd.read_csv('../df_1880.csv', index_col =0)
df.head()



Unnamed: 0_level_0,frequency,gender
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Mary,7065,F
Anna,2604,F
Emma,2003,F
Elizabeth,1939,F
Minnie,1746,F
...,...,...
Woodie,5,M
Worthy,5,M
Wright,5,M
York,5,M


In [96]:

# 2. display the 'frequency' column

df["frequency"]

name
Mary         7065
Anna         2604
Emma         2003
Elizabeth    1939
Minnie       1746
             ... 
Woodie          5
Worthy          5
Wright          5
York            5
Zachariah       5
Name: frequency, Length: 2000, dtype: int64

In [95]:
# 3. display the 'gender' and 'frequency' columns

df[["frequency","gender"]]


Unnamed: 0_level_0,frequency,gender
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Mary,7065,F
Anna,2604,F
Emma,2003,F
Elizabeth,1939,F
Minnie,1746,F
...,...,...
Woodie,5,M
Worthy,5,M
Wright,5,M
York,5,M


In [97]:
# 4. display the data for row(s) containing William
df.loc["William"]

Unnamed: 0_level_0,gender,frequency
name,Unnamed: 1_level_1,Unnamed: 2_level_1
William,F,30
William,M,9532


In [101]:
# 5. display the data for all rows with William, Paul, and Anne

df.loc[["William", "Paul", "Anne"]]

Unnamed: 0_level_0,gender,frequency
name,Unnamed: 1_level_1,Unnamed: 2_level_1
William,F,30
William,M,9532
Paul,M,301
Anne,F,136


In [112]:
#6. display the 'frequency' column for William, Paul, and Anne
df.loc[["William", "Paul", "Anne"],["frequency"]]

Unnamed: 0_level_0,frequency
name,Unnamed: 1_level_1
William,30
William,9532
Paul,301
Anne,136


In [108]:
# 7. display the first three names and both columns

df.head(3)

Unnamed: 0_level_0,gender,frequency
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Mary,F,7065
Anna,F,2604
Emma,F,2003


In [111]:
 #8. display the both columns for every second name
df.iloc[::2]

Unnamed: 0_level_0,gender,frequency
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Mary,F,7065
Emma,F,2003
Minnie,F,1746
Ida,F,1472
Bertha,F,1320
...,...,...
Unknown,M,5
Wes,M,5
Wood,M,5
Worthy,M,5


## License
(c) 2022 Samuel McGuire.
Distributed under the conditions of the MIT License.