# Introduction to Python for Data Science

## 1. Loading Libraries

<div style="text-align: justify"> In programming, a library is a collection of functions or methods that perform different computational actions. In short, commonly used functions or methods are provided for within the library and hence saving you from needing to write your own code.
Loading a library is rather easy, using the import...as command will do as below </div>

In [1]:
import pandas as pd 

<div style = "text-align: justify"> The significance of doing as is to rename your library, so instead of typing pandas when you want to invoke a library method, is that you can just use a shorter version, that is pd. <br>

Additional reading: https://mode.com/python-tutorial/python-methods-functions-and-libraries/ <div>

## 2. DataFrames in pandas

<div style = "text-align: justify"> DataFrame is a data structure. Very much like stack and queues, we have dedicated libraries to deal with different types of data stuctures. In this case it is the "pandas" library. <br>

Additional reading about pandas data structure: https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html <div>

<div style = "text-align: justify"> We can now create a new data frame and this dataframe will be assigned to a variable. This variable acts as a pointer to your dataframe object. This might sound confusing, but it literally just means that your variable name now points to your newly created dataframe.<div>

In [2]:
dataFrameName = pd.DataFrame({
                'StudentID' : [264422,264423,264444,264445,264446],
                'FirstName' : ['Steven','Alex','Bill','Mark','Bob'],
                'EnrolYear' : [2010,2010,2011,2011,2013],
                'Math'      : [100,90,90,40,60],
                'English'   : [60,70,80,80,60]
                })

<div style = "text-align: justify"> You may notice that the parameter that .DataFrame takes is a dictionary with a key and its value a list, and you're right. This is the format required to create a DataFrame structure. You will also notice that the key is the row header and the value is the column corresponding to the row headers. <div>

In [3]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English
0,264422,Steven,2010,100,60
1,264423,Alex,2010,90,70
2,264444,Bill,2011,90,80
3,264445,Mark,2011,40,80
4,264446,Bob,2013,60,60


In [4]:
print(dataFrameName)

   StudentID FirstName  EnrolYear  Math  English
0     264422    Steven       2010   100       60
1     264423      Alex       2010    90       70
2     264444      Bill       2011    90       80
3     264445      Mark       2011    40       80
4     264446       Bob       2013    60       60


<div styles = "text-align: justify">The difference between these two cells, with and without print is that under the hood what jupyter does is that it automatically renders the default dataframe display. Before this, in order to display nice DataFrames nicely, you had to do display(dataFrameName) instead. Now this is automatically done for you so the table looks nice.<br>

Doing print() will give you the unrendered raw display instead. <br>
    Further reading: https://stackoverflow.com/questions/26873127/show-dataframe-as-table-in-ipython-notebook
    <div>

### Practise 1: Create another data table called "df2" that contains the height of some of the students. 

This should be fairly easy to achieve, it's basically just doing the same thing you've done before.

In [5]:
df2 = pd.DataFrame({"Height": [160, 155, 175, 175], 
                    "Student": [264422, 264423, 264444, 264445]})

In [6]:
df2

Unnamed: 0,Height,Student
0,160,264422
1,155,264423
2,175,264444
3,175,264445


In [7]:
print(df2)

   Height  Student
0     160   264422
1     155   264423
2     175   264444
3     175   264445


## 2.1 Column and row selection

A column can be selected using the [] notation. Remember, column names are CASE-SENSITIVE. 

In [8]:
dataFrameName['FirstName']

0    Steven
1      Alex
2      Bill
3      Mark
4       Bob
Name: FirstName, dtype: object

In [9]:
type(dataFrameName['FirstName'])

pandas.core.series.Series

<div style = "text-align: justify"> This is the part where it gets slightly interesting. Remember how you can do type() in python to retrieve the type of the object you want to know? You can do the same here for pandas library, and you'll find out that selecting a column will get you a Series data structure instead of a dataframe structure. Which explains why you can't get the same pretty rendering when you do dataFrameName['FirstName']<br>
    
In order to display the column nicely you can put an extra pair of brackets for your column name instead so it makes it seem like it's a dataframe object instead of series. <div>

In [10]:
dataFrameName[['FirstName']]

Unnamed: 0,FirstName
0,Steven
1,Alex
2,Bill
3,Mark
4,Bob


In [11]:
type(dataFrameName[['FirstName']])

pandas.core.frame.DataFrame

You may also use the dot notation, but I could not find the dataframe equivalent for dot notation. <br>

Further reading on why square bracket accessing is superior: https://stackoverflow.com/questions/41130255/what-is-the-difference-between-using-squared-brackets-or-dot-to-access-a-column

In [12]:
dataFrameName.FirstName

0    Steven
1      Alex
2      Bill
3      Mark
4       Bob
Name: FirstName, dtype: object

Remember that you may also store such results as variables and use them later on, like below

In [13]:
columnFirstName = dataFrameName['FirstName']

In [14]:
columnFirstName

0    Steven
1      Alex
2      Bill
3      Mark
4       Bob
Name: FirstName, dtype: object

In [15]:
type(columnFirstName)

pandas.core.series.Series

## 2.1.1 Conditioning 
We can also impose certain conditions and only collect those with the fulfilled condition. Why do we need to use [] twice? The answer is rather vague, and I will update this notebook accordingly when I find out. 
<br>
Further reading: https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/

In [16]:
addedCondition = dataFrameName[dataFrameName['FirstName'] == 'Alex']

In [17]:
addedCondition

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English
1,264423,Alex,2010,90,70


## 2.2 Modifying the structure of a table

Creating a new column through square bracket notation 

In [18]:
dataFrameName['NewColumn'] = None

In [19]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn
0,264422,Steven,2010,100,60,
1,264423,Alex,2010,90,70,
2,264444,Bill,2011,90,80,
3,264445,Mark,2011,40,80,
4,264446,Bob,2013,60,60,


<div style = "text-align: justify"> As you can see, using square bracket notation on a dataframe in which the column does not exist will create a new column for you. You just have to specify what to put in each row for the newly created column. In this case it is None. 
    <br>
    
If you have specific values for each column, do the below instead. <div> 

In [20]:
dataFrameName['NewColumn'] = [1,2,3,4,5]

In [21]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn
0,264422,Steven,2010,100,60,1
1,264423,Alex,2010,90,70,2
2,264444,Bill,2011,90,80,3
3,264445,Mark,2011,40,80,4
4,264446,Bob,2013,60,60,5


If you want to get the total of both columns of same data type, do the following instead.

In [22]:
dataFrameName['NewColumn'] = dataFrameName['Math'] + dataFrameName['English']

In [23]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn
0,264422,Steven,2010,100,60,160
1,264423,Alex,2010,90,70,160
2,264444,Bill,2011,90,80,170
3,264445,Mark,2011,40,80,120
4,264446,Bob,2013,60,60,120


<div style = "text-align: justify">What happens if you add columns of different data type? Remember you have to convert int data types to str before concatenating. But as you can see below, doing that simply doesn't work the way we intended it to. Let's refresh my NewColumn so it is the res of math + english<div>

In [24]:
dataFrameName['NewColumn'] = dataFrameName['FirstName'] + str(dataFrameName['English'])

In [25]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn
0,264422,Steven,2010,100,60,Steven0 60\n1 70\n2 80\n3 80\n4 ...
1,264423,Alex,2010,90,70,Alex0 60\n1 70\n2 80\n3 80\n4 6...
2,264444,Bill,2011,90,80,Bill0 60\n1 70\n2 80\n3 80\n4 6...
3,264445,Mark,2011,40,80,Mark0 60\n1 70\n2 80\n3 80\n4 6...
4,264446,Bob,2013,60,60,Bob0 60\n1 70\n2 80\n3 80\n4 60...


In [26]:
dataFrameName['NewColumn'] = dataFrameName['Math'] + dataFrameName['English']

In [27]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn
0,264422,Steven,2010,100,60,160
1,264423,Alex,2010,90,70,160
2,264444,Bill,2011,90,80,170
3,264445,Mark,2011,40,80,120
4,264446,Bob,2013,60,60,120


### Practice 3: Add a new column showing the average mark of "Math" and "English" for eachstudent. (Average = Total/number of subjects)

In [28]:
dataFrameName['Average'] = dataFrameName['NewColumn'] / 2

In [29]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn,Average
0,264422,Steven,2010,100,60,160,80.0
1,264423,Alex,2010,90,70,160,80.0
2,264444,Bill,2011,90,80,170,85.0
3,264445,Mark,2011,40,80,120,60.0
4,264446,Bob,2013,60,60,120,60.0


### Practice 4: Write a filter to select students who have scored more than 150 in total and who have achieved a subject score (in Maths or English) of 90 or more. (Hint: careful with the parenthesis)
<br>
General rule of thumb, just place a bracket around each and every conditional statement below. Or else your code will not work with this error. 
<br>

TypeError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [30]:
filter = (dataFrameName['NewColumn'] > 150) & ((dataFrameName['Math'] >= 90) | (dataFrameName['English'] >= 90))

In [31]:
dataFrameName[filter]

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn,Average
0,264422,Steven,2010,100,60,160,80.0
1,264423,Alex,2010,90,70,160,80.0
2,264444,Bill,2011,90,80,170,85.0


### Melt function, turning column names into row values 

<div style = "text-align: justify"> Attached below is the full structure of pandas.melt(), not sure why in this unit only a few is introduced. In order to "melt" two column names into values, 
    
<br>we have to set id_vars (identifier variables) to contain a list of columns that you still want to keep. 

<br>
value_vars to the column names that you want to turn into values. 
    
<br>var_name is the name of the column of your newly created column turned into values. 
    
<br>value_name is the name of the column to cater to your values from your original column
    
<br> 
you may ignore col_level and ignore_index for now.<div>
    
```Python
pandas.melt(frame, 
            id_vars=None, 
            value_vars=None, 
            var_name=None, 
            value_name='value', 
            col_level=None, 
            ignore_index=True)
```

Further reading: https://pandas.pydata.org/docs/reference/api/pandas.melt.html#pandas.melt

<div style = "text-align: justify"> Let's say I want to melt Math and English column names into values, and name the column "subject" and name the column for the original values from English and Math to "Marks" <div>

In [32]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn,Average
0,264422,Steven,2010,100,60,160,80.0
1,264423,Alex,2010,90,70,160,80.0
2,264444,Bill,2011,90,80,170,85.0
3,264445,Mark,2011,40,80,120,60.0
4,264446,Bob,2013,60,60,120,60.0


In [33]:
meltedFrame = pd.melt(dataFrameName, 
                      ['StudentID', 'FirstName', 'EnrolYear'], 
                      ['Math','English'],
                      'Subject',
                      'Marks')

In [34]:
meltedFrame

Unnamed: 0,StudentID,FirstName,EnrolYear,Subject,Marks
0,264422,Steven,2010,Math,100
1,264423,Alex,2010,Math,90
2,264444,Bill,2011,Math,90
3,264445,Mark,2011,Math,40
4,264446,Bob,2013,Math,60
5,264422,Steven,2010,English,60
6,264423,Alex,2010,English,70
7,264444,Bill,2011,English,80
8,264445,Mark,2011,English,80
9,264446,Bob,2013,English,60


<div styles = "text-align: justify">If you refer to the tutorial lab sheet, you'll realize that they didn't put the last parameter like i did, that is value_name. You can input this extra parameter to have the column name 'Marks' instead of 'value'. Of course they immediately try and help you correct it by introducing to you how to rename.<div>

### Renaming column names
<div style = "text-align: justify"> In order to keep it simple, the documentation for .rename() is rather extensive. Here's the full list of parameters below. <div>
<br>
However we need to focus on the parameter inplace. inplace is defaulted to false, meaning when you change the name of a column, you're changing the original dataframe you've created. If you set it to True, the function returns a new dataframe instead of altering the old one. 
<br>
    
```Python
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')
```
Further reading: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In [35]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,NewColumn,Average
0,264422,Steven,2010,100,60,160,80.0
1,264423,Alex,2010,90,70,160,80.0
2,264444,Bill,2011,90,80,170,85.0
3,264445,Mark,2011,40,80,120,60.0
4,264446,Bob,2013,60,60,120,60.0


In [36]:
dataFrameName.rename(columns = {'NewColumn' : "Total"}, inplace = True)

In [37]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,Total,Average
0,264422,Steven,2010,100,60,160,80.0
1,264423,Alex,2010,90,70,160,80.0
2,264444,Bill,2011,90,80,170,85.0
3,264445,Mark,2011,40,80,120,60.0
4,264446,Bob,2013,60,60,120,60.0


What happens if I set inplace to False? Well, since I know that if it's False, the function will return something, this means i can store it in a variable. In this case, I have stored it in copyDataFrame. And I can compare my copyDataFrame with dataFrameName to see if there's a difference. I will be changing the word 'Average' as can be seen in my columns parameter. 

In [38]:
copyDataFrame = dataFrameName.rename(columns = {'Average' : "Median"}, inplace = False)

In [39]:
copyDataFrame

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,Total,Median
0,264422,Steven,2010,100,60,160,80.0
1,264423,Alex,2010,90,70,160,80.0
2,264444,Bill,2011,90,80,170,85.0
3,264445,Mark,2011,40,80,120,60.0
4,264446,Bob,2013,60,60,120,60.0


In [40]:
dataFrameName

Unnamed: 0,StudentID,FirstName,EnrolYear,Math,English,Total,Average
0,264422,Steven,2010,100,60,160,80.0
1,264423,Alex,2010,90,70,160,80.0
2,264444,Bill,2011,90,80,170,85.0
3,264445,Mark,2011,40,80,120,60.0
4,264446,Bob,2013,60,60,120,60.0


And indeed, there is a difference. This is because the 'Average' is not renamed in the original dataFrameName.

### 2.3 Merging DataFrames
<div styles = "text-align: justify">Nice concept, what happens when you want to join two different DataFrame objects? Let's start by making some examples: Make sure that both of your dataframes have at least one common column. You can specify the column to merge on by using the 'on' parameter, in case you have more than one similar column.
<br>

Further reading:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html<div>

In [41]:
dataFrame1 = pd.DataFrame({"Year":[2010, 2011, 2012], "Car Crashes" : [100, 189, 400]})

In [42]:
dataFrame2 = pd.DataFrame({"Year" : [2010, 2011, 2012], "Drink-drive cases" : [200, 300, 600]})

In [43]:
dataFrame1

Unnamed: 0,Year,Car Crashes
0,2010,100
1,2011,189
2,2012,400


In [44]:
dataFrame2

Unnamed: 0,Year,Drink-drive cases
0,2010,200
1,2011,300
2,2012,600


In [45]:
merged = pd.merge(dataFrame1, dataFrame2, on=['Year'])

In [46]:
merged

Unnamed: 0,Year,Car Crashes,Drink-drive cases
0,2010,100,200
1,2011,189,300
2,2012,400,600


## 3. Reading CSV and Excel Files into Data Frames

<div styles = "text-align: justify"> What are CSV (Comma Seperated Values) files?<div>
    
<br>
<div styles = "text-align: justify">
CSV files are literally what they sound like. 
Each value in a row is seperated by a comma. Meaning if I do something like "Col1", "Col2", this will represent two columns. You can even open CSV files in Excel. We can open a CSV file by invoking the .read_csv method. </div>
    
<br>
<div styles = "text-align: justify">
Just in case your files are in another folder, you just have to state the directory to the file you want. Remember, the first place that .read_csv will search is the current location of this very file. If your uforeports.csv is not in the same location as this current file, an error will occur. Think of it like steps, in this very currently folder where my python file is at contains another folder called 'Week 2 Tutorial Files-20210312', so I need to open this folder first, before accessing my 'uforeports.csv'. It is important to specify to pandas where is the file, and where to find the file
</div>

In [47]:
alien_report = pd.read_csv('Week 2 Tutorial Files-20210312/uforeports.csv')

In [48]:
alien_report

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-01-06 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-01-06 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00
...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,2000-12-31 23:00:00
18237,Spirit Lake,,DISK,IA,2000-12-31 23:00:00
18238,Eagle River,,,WI,2000-12-31 23:45:00
18239,Eagle River,RED,LIGHT,WI,2000-12-31 23:45:00


In [49]:
type(alien_report)

pandas.core.frame.DataFrame

As you can see, reading a csv using pandas will convert it into a DataFrame struct. 

### Displaying sections of large DataFrame objects

Using the method .head() will by default display the first 5 lines of a dataframe object. 

In [50]:
alien_report.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-01-06 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-01-06 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


However you may also specify how many lines from the top using .head(), you want it to display by passing in an integer. For example if you want the first 10 lines of the dataframe. 

In [51]:
alien_report.head(10)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-01-06 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-01-06 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00
5,Valley City,,DISK,ND,1934-09-15 15:30:00
6,Crater Lake,,CIRCLE,CA,1935-06-15 00:00:00
7,Alma,,DISK,MI,1936-07-15 00:00:00
8,Eklutna,,CIGAR,AK,1936-10-15 17:00:00
9,Hubbard,,CYLINDER,OR,1937-06-15 00:00:00


### Practice 9: How can we display the last 5 records? How can we display the last record only?

Another function called .tail() will display 5 lines from the bottom instead. 

In [52]:
alien_report.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,2000-12-31 23:00:00
18237,Spirit Lake,,DISK,IA,2000-12-31 23:00:00
18238,Eagle River,,,WI,2000-12-31 23:45:00
18239,Eagle River,RED,LIGHT,WI,2000-12-31 23:45:00
18240,Ybor,,OVAL,FL,2000-12-31 23:59:00


To obtain the last record, simply pass in 1 to show 1 line from the bottom.

In [53]:
alien_report.tail(1)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18240,Ybor,,OVAL,FL,2000-12-31 23:59:00


### Displaying Excel files 

Excel files are a bit more troublesome because we have to specify the sheet name as well. If you recall when using excel, you can have multiple worksheets in the same excel file. The extension for an excel file is xls. 
<br>
Use the .read_excel() function instead, and pass in two parameters. One the path to your file, and the other is the sheet_name once you've opened the excel file.

In [54]:
alien_report_xls = pd.read_excel('Week 2 Tutorial Files-20210312/Uforeports_excel.xls', sheet_name = 'uforeports')

In [55]:
alien_report_xls

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Unnamed: 5,Unnamed: 6
0,Ithaca,,TRIANGLE,NY,1930-06-01,10:00:00,PM
1,Willingboro,,OTHER,NJ,1930-06-30,20:00:00,
2,Holyoke,,OVAL,CO,1931-02-15,14:00:00,
3,Abilene,,DISK,KS,1931-06-01,01:00:00,PM
4,New York Worlds Fair,,LIGHT,NY,1933-04-18,19:00:00,
...,...,...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,2000-12-31,23:00:00,
18237,Spirit Lake,,DISK,IA,2000-12-31,23:00:00,
18238,Eagle River,,,WI,2000-12-31,23:45:00,
18239,Eagle River,RED,LIGHT,WI,2000-12-31,23:45:00,


But hey, the Time column is very much displayed in different formats. This is because the type of the Time column is different for both file formats. The PDF file for this section explains it very well. We just have to convert the form we don't want to another form. We can use the following method to find out what type of values the column contains. 

In [60]:
print(alien_report.Time.dtypes)

object


In [61]:
print(alien_report_xls.Time.dtypes)


datetime64[ns]


As you can see, the alien_report csv file treats the Time values as an object (str, or anything). Where the xls file treats the Time values as a datetime object. This is because csv content is literally just text! xls is more advanced in a sense that it knows you're talking about time. Anyways, this shouldn't concern you. But just keep in mind that pandas's dtype (dtype are data types specific to pandas library), object are just numeric values or strings. Read the link and you'll get it. Convert any object to datetime data type using .to_datetime()

Further reading: https://pbpython.com/pandas_dtypes.html

In [66]:
alien_report.Time = pd.to_datetime(alien_report.Time)

In [67]:
alien_report.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-01-06 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-01-06 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


In [68]:
print(alien_report.Time.dtype)

datetime64[ns]


## 4 Basic data auditing

### Practice 10: Write a statement to read in titanic data (titanic.csv) into a pandas DataFrame called ‘titanic’. Then print out the first few rows.
Should be pretty simple, just invoke .read_csv() to turn it into a dataframe object

In [69]:
titanicCSV = pd.read_csv('Week 2 Tutorial Files-20210312/titanic.csv')

In [70]:
titanicCSV

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


A pretty handy function to know if you just want to see how many rows and columns does a dataframe object have. .shape is not a function. It is merely accessing the properties that this object has. Although technically the cell does show the row and columns already...

In [72]:
print(titanicCSV.shape)

(891, 15)


Remember we can also investigate the dtype of each column by using .dtype by inspecting the name of the column one by one. 

In [77]:
print(titanicCSV.survived.dtype)
print(titanicCSV.pclass.dtype)
print(titanicCSV.sex.dtype)
print(titanicCSV.age.dtype)
print(titanicCSV.sibsp.dtype)
print(titanicCSV.parch.dtype)

int64
int64
object
float64
int64
int64


This is the end of this tutorial sheet. I will not go on to segregation because it's just introducing a bunch of different methods to use to find certain things. Other than .groupby().

Further reading: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

In [86]:
sex_class = titanicCSV.groupby(['class','sex'])['age']

In [87]:
sex_class

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002846C112820>

In [88]:
sex_class.mean()

class   sex   
First   female    34.611765
        male      41.281386
Second  female    28.722973
        male      30.740707
Third   female    21.750000
        male      26.507589
Name: age, dtype: float64

In [89]:
type(sex_class)

pandas.core.groupby.generic.SeriesGroupBy