### Starting out with datasets

In [1]:
import pandas as pd

Below we import data from a text file called 'balance.txt'. The text file is found in the task folder. Make sure it is in the same directory that the notebook is saved in.

In [2]:
df = pd.read_csv('balance.txt',sep=' ')


Here is how to view the top rows of the frame. The `head()` function shows the first five observations. Use this to get a glimpse of the data such as the column names and the type of data in the columns. 

In [3]:
df.head()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
0,12.240798,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian
1,23.283334,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian
2,22.530409,104.593,7075,514,4,71,11,Male,No,No,Asian
3,27.652811,148.924,9504,681,3,36,11,Female,No,No,Asian
4,16.893978,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian


This shows the last observations of the dataset

In [4]:
df.tail(7)

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
393,10.958612,17.316,1335,138,2,65,13,Male,No,No,African American
394,14.735482,49.794,5758,410,4,40,8,Male,No,No,Caucasian
395,8.764984,12.096,4100,307,3,32,13,Male,No,Yes,Caucasian
396,9.943838,13.364,3838,296,5,65,17,Male,No,No,African American
397,14.882078,57.872,4171,321,5,67,12,Female,No,Yes,Caucasian
398,12.001071,37.728,2525,192,1,44,13,Male,No,Yes,Caucasian
399,10.159598,18.701,5524,415,5,64,7,Female,No,No,Asian


To get the range of indexes of your dataset use the syntax `dataset_name.index`. This helps you to know how to refer to your observations. By using the index function below, we know the range of the dataset is from 0-400 and therefore you cannot index an observation that is not within that range. For example, index 450 would not be a valid index for this dataset.

In [5]:
df.index

RangeIndex(start=0, stop=400, step=1)

This allows you to see the columns in the data frame. You will need this when you are doing an analysis and are writing reports based on the dataset.

In [6]:
df.columns

Index(['Balance', 'Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education',
       'Gender', 'Student', 'Married', 'Ethnicity'],
      dtype='object')

`describe()` shows a quick statistic summary of your data. As you can see, statistics are only calculated for columns with numerical values. 

In [7]:
df.describe()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,13.429175,45.218885,4735.6,354.94,2.9575,55.6675,13.45
std,5.669256,35.244273,2308.198848,154.724143,1.371275,17.249807,3.125207
min,3.749403,10.354,855.0,93.0,1.0,23.0,5.0
25%,9.891439,21.00725,3088.0,247.25,2.0,41.75,11.0
50%,11.779615,33.1155,4622.5,344.0,3.0,56.0,14.0
75%,15.236961,57.47075,5872.75,437.25,4.0,70.0,16.0
max,38.785123,186.634,13913.0,982.0,9.0,98.0,20.0


`sort_values()` helps to arrange observations in a well ordered manner. The function will take in parameters such as column name. By default the observations will be sorted in ascending order. If you want to display data in descending order, you will have to set ascending to false.

In [8]:
df.sort_values(by='Income',ascending=False).head()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
28,35.271011,186.634,13414,949,2,41,14,Female,No,Yes,African American
323,33.74558,182.728,13913,982,4,98,17,Male,No,Yes,Caucasian
355,34.034656,180.682,11966,832,2,58,8,Female,No,Yes,African American
261,38.785123,180.379,9310,665,3,67,8,Female,Yes,Yes,Asian
275,30.21208,163.329,8732,636,3,50,14,Male,No,Yes,Caucasian


Selecting a single column, which yields a Series.



In [9]:
df.Rating.head(5)

0    283
1    483
2    514
3    681
4    357
Name: Rating, dtype: int64

Selecting via [ ], which slices the rows.



In [10]:
df[50:60]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
50,10.107356,36.362,5183,376,3,49,15,Male,No,Yes,African American
51,13.010768,39.705,3969,301,2,27,20,Male,No,Yes,African American
52,11.924342,44.205,5441,394,1,32,12,Male,No,Yes,Caucasian
53,9.728192,16.304,5466,413,4,66,10,Male,No,Yes,Asian
54,7.665662,15.333,1499,138,2,47,9,Female,No,Yes,Asian
55,11.454337,32.916,1786,154,2,60,8,Female,No,Yes,Asian
56,17.053691,57.1,4742,372,7,79,18,Female,No,Yes,Asian
57,18.155488,76.273,4779,367,4,65,14,Female,No,Yes,Caucasian
58,9.180797,10.354,3480,281,2,70,17,Male,No,Yes,Caucasian
59,16.424095,51.872,5294,390,4,81,17,Female,No,No,Caucasian


In [11]:
df.loc[40:50]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
40,12.029646,34.95,3327,253,3,54,14,Female,No,No,African American
41,25.291008,113.659,7659,538,2,66,15,Male,Yes,Yes,African American
42,13.123669,44.158,4763,351,2,66,13,Female,No,Yes,Asian
43,12.319976,36.929,6257,445,1,24,14,Female,No,Yes,Asian
44,12.059596,31.861,6375,469,3,25,16,Female,No,Yes,Caucasian
45,18.653661,77.38,7569,564,3,50,12,Female,No,Yes,Caucasian
46,10.805825,19.531,5043,376,2,64,16,Female,Yes,Yes,Asian
47,11.488565,44.646,4431,320,2,49,15,Male,Yes,Yes,Caucasian
48,13.433468,44.522,2252,205,6,72,15,Male,No,Yes,Asian
49,14.007633,43.479,4569,354,4,49,13,Male,Yes,Yes,African American


#### Selection by Label

You can select a range of columns and rows for viewing. Like in the syntax below. `5:8` means 5 to 8 and `1,7` means 1 and 7. To give a range of observations, use a semicolon. To select a column use a comma. For example below we have selected column 1 and 7.

In [12]:
df.iloc[5:8,[1,7]]


Unnamed: 0,Income,Gender
5,80.18,Male
6,20.996,Female
7,71.408,Male


Using a single column’s values to select data. Using the example below, we want to find if there are any users who are above the age of 90.



In [13]:
df[df.Age > 90]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
209,28.14285,151.947,9156,642,2,91,11,Female,No,Yes,African American
323,33.74558,182.728,13913,982,4,98,17,Male,No,Yes,Caucasian
