## Lecture Housekeeping:

- The use of disrespectful language is prohibited in the questions, this is a supportive, learning environment for all - please engage accordingly.
    - Please review Code of Conduct (in Student Undertaking Agreement) if unsure
- No question is daft or silly - ask them!
- There are Q&A sessions midway and at the end of the session, should you wish to ask any follow-up questions.
- Should you have any questions after the lecture, please schedule a mentor session.
- For all non-academic questions, please submit a query: [www.hyperiondev.com/support](www.hyperiondev.com/support)

## Datasets and Dataframes

#### Learning objectives

   - Introduction to Datasets and Dataframes
   - Overview of structuring and organising data

### Working with datasets

In [3]:
''' 
Using the pandas module, we'll be able
    to open .txt and .csv files
'''

import pandas as pd

df = pd.read_csv('balance_missing.txt', sep=' ')
df.head()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
0,12.2407984760474,14.891,3606,283.0,2.0,,11.0,Male,No,Yes,Caucasian
1,23.2833339223376,,6645,483.0,3.0,82.0,15.0,Female,,Yes,Asian
2,22.5304088790893,104.593,7075,,4.0,71.0,11.0,Male,No,No,Asian
3,27.652810710665,148.924,9504,681.0,3.0,36.0,11.0,,No,No,Asian
4,16.8939784904888,55.882,4897,357.0,2.0,68.0,16.0,Male,No,Yes,Caucasian


Here we imported data from a text file named 'balance.txt' it's always more convinient to keep the notebook and your datasets within the same file.

Additionally, if we would like to view the first 5 rows of our dataset, we can use the .head() method which will provide us with the first 5 rows. We usually use this method to get a glimpse of what our dataset is like, such as finding the column names and the type of data in the columns.

In [4]:
''' 
We can also use .tail() to find the last
    observations in a dataset
'''

df.tail()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
395,8.76498389979819,12.096,4100,307.0,3.0,32.0,13.0,Male,No,Yes,Caucasian
396,9.94383770023455,13.364,3838,296.0,5.0,65.0,17.0,,No,No,African American
397,14.882078455256,57.872,4171,321.0,5.0,67.0,12.0,Female,No,Yes,Caucasian
398,12.0010707267157,37.728,2525,192.0,1.0,44.0,13.0,,No,Yes,Caucasian
399,10.1595983903564,18.701,5524,415.0,5.0,64.0,7.0,Female,No,No,


It's useful to understand the scale of the dataset, using the .index method, we can find out the range of the rows of our dataset.

In [5]:
'''
Remember that pandas converts a dataset into
    a fancy 2D array so the end value is exclusive
        meaning the rows end at 399
'''

df.index

RangeIndex(start=0, stop=400, step=1)

If we would like to see the columns in the dataset, we can use the .columns method

In [6]:
df.columns

Index(['Balance', 'Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education',
       'Gender', 'Student', 'Married', 'Ethnicity'],
      dtype='object')

`sort_values()` helps to arrange observations in a well ordered manner. The function will take in parameters such as column name. By default the observations will be sorted in ascending order. If you want to display data in descending order, you will have to set ascending to false.

In [27]:
df.sort_values(by='Income', ascending=False)

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
28,35.271011,186.634,13414,949,2,41,14,Female,No,Yes,African American
323,33.745580,182.728,13913,982,4,98,17,Male,No,Yes,Caucasian
355,34.034656,180.682,11966,832,2,58,8,Female,No,Yes,African American
261,38.785123,180.379,9310,665,3,67,8,Female,Yes,Yes,Asian
275,30.212080,163.329,8732,636,3,50,14,Male,No,Yes,Caucasian
...,...,...,...,...,...,...,...,...,...,...,...
262,7.653979,10.588,4049,296,1,66,13,Female,No,Yes,Caucasian
235,7.503813,10.503,2923,232,3,25,18,Female,No,Yes,African American
199,6.687342,10.403,4159,310,3,43,7,Male,No,Yes,Asian
250,8.573448,10.363,2430,191,2,47,18,Female,No,Yes,Asian


We can also select a single column, which would output a Series

In [13]:
df.Rating.head()

0    283
1    483
2    514
3    681
4    357
Name: Rating, dtype: int64

We can also slice the data (since its essentially a 2D array)

In [11]:
df[50:60]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
50,10.1073556011089,36.362,5183,376.0,3.0,49.0,15.0,Male,No,Yes,African American
51,13.0107676261896,39.705,3969,301.0,2.0,27.0,20.0,Male,No,Yes,African American
52,11.924342231818,44.205,5441,394.0,1.0,32.0,12.0,Male,No,Yes,Caucasian
53,9.72819204081147,16.304,5466,413.0,4.0,66.0,10.0,Male,No,Yes,Asian
54,7.66566199430089,15.333,1499,138.0,2.0,47.0,9.0,Female,No,Yes,Asian
55,11.4543371959078,32.916,1786,154.0,2.0,60.0,8.0,Female,No,Yes,Asian
56,17.05369062475,57.1,4742,372.0,7.0,79.0,18.0,Female,No,Yes,Asian
57,18.1554884853513,76.273,4779,367.0,4.0,65.0,14.0,Female,No,Yes,Caucasian
58,9.18079694095874,10.354,3480,281.0,2.0,70.0,17.0,Male,No,Yes,Caucasian
59,16.4240947050359,,5294,390.0,4.0,81.0,17.0,Female,No,No,Caucasian


The .loc function accesses a group of rows and columns by labels.

In [13]:
df.loc[50:60,['Income','Age']]

Unnamed: 0,Income,Age
50,36.362,49.0
51,39.705,27.0
52,44.205,32.0
53,16.304,66.0
54,15.333,47.0
55,32.916,60.0
56,57.1,79.0
57,76.273,65.0
58,10.354,70.0
59,,81.0


#### Selection by Label

You can select a range of columns and rows for viewing. Like in the syntax below. `5:8` means 5 to 8 and `1,7` means 1 and 7. To give a range of observations, use a semicolon. To select a column use a comma. For example below we have selected column 1 and 7.

In [28]:
df.iloc[5:10,[1,7]]

Unnamed: 0,Income,Gender
5,80.18,Male
6,20.996,Female
7,71.408,Male
8,15.125,Female
9,71.061,Female


Using a single column’s values to select data. Using the example below, we want to find if there are any users who are above the age of 90.

In [18]:
df[df.Age > 90]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
209,28.14285,151.947,9156,642,2,91,11,Female,No,Yes,African American
323,33.74558,182.728,13913,982,4,98,17,Male,No,Yes,Caucasian


In [32]:
df.Income

0       14.891
1      106.025
2      104.593
3      148.924
4       55.882
        ...   
395     12.096
396     13.364
397     57.872
398     37.728
399     18.701
Name: Income, Length: 400, dtype: float64