# Creating, Reading and Writing

Importing pandas

In [2]:
import pandas as pd

## Creating data

There are two core objects in pandas: 
- the DataFrame 
- and the Series.

### 1. DataFrame

In [3]:
pd.DataFrame({'Yes':[50,20], 'No':[21,11]})

Unnamed: 0,Yes,No
0,50,21
1,20,11


DataFrame entries are not limited to integers. 

In [4]:
pd.DataFrame({'Harry':['I am Harry.', 'I am nine.'],'Eline':['I am Eline.', 'I am eight.']})

Unnamed: 0,Harry,Eline
0,I am Harry.,I am Eline.
1,I am nine.,I am eight.


The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels.<br>
The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

In [5]:
pd.DataFrame({'Harry':['I am Harry.', 'I am nine.'],'Eline':['I am Eline.', 'I am eight.']}, index=['Name', 'Age'])

Unnamed: 0,Harry,Eline
Name,I am Harry.,I am Eline.
Age,I am nine.,I am eight.


### 2. Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list.

In [6]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [7]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

## Reading data files

* Reading a CSV file:

In [8]:
titanic = pd.read_csv("../DataSets/titanic.csv")

In [9]:
health = pd.read_csv("../DataSets/Health_heart_experimental.csv")

We can use the shape attribute to check how large the resulting DataFrame is:

In [10]:
titanic.shape # (Rows, Column)

(891, 12)

In [11]:
titanic.head

<bound method NDFrame.head of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                     

---
# Indexing, Selecting & Assigning

## Native accessors

Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

In [12]:
health

Unnamed: 0.1,Unnamed: 0,age,sex,SysBP,DiaBP,HR,weightKg,heightCm,BMI,indication
0,0,64,1,141,96,128,69,147,32.0,1
1,1,21,1,109,100,106,48,150,21.0,0
2,2,30,0,112,73,126,69,183,21.0,0
3,3,35,1,106,90,130,45,149,20.0,0
4,4,39,0,140,90,112,92,166,33.0,1
...,...,...,...,...,...,...,...,...,...,...
71755,71755,50,1,133,88,116,53,154,22.0,0
71756,71756,69,1,123,112,94,70,142,35.0,1
71757,71757,24,1,143,107,107,57,156,23.0,0
71758,71758,42,1,122,111,150,115,176,37.0,1


In Python, we can access the property of an object by accessing it as an attribute.

In [13]:
health.age

0        64
1        21
2        30
3        35
4        39
         ..
71755    50
71756    69
71757    24
71758    42
71759    49
Name: age, Length: 71760, dtype: int64

We can access its values using the indexing ([]) operator. 

In [14]:
health['BMI']

0        32.0
1        21.0
2        21.0
3        20.0
4        33.0
         ... 
71755    22.0
71756    35.0
71757    23.0
71758    37.0
71759    36.0
Name: BMI, Length: 71760, dtype: float64

The key advantages of indexing operator is it can handle whitespaces.

TO spot a specific value:

In [15]:
health['BMI'][0]

np.float64(32.0)

## Indexing in pandas

Pandas has two acessor operator: loc and iloc

In [16]:
health.iloc[0]

Unnamed: 0      0.0
age            64.0
sex             1.0
SysBP         141.0
DiaBP          96.0
HR            128.0
weightKg       69.0
heightCm      147.0
BMI            32.0
indication      1.0
Name: 0, dtype: float64

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.<br>
To get a column with iloc, we can do the following:

In [17]:
health.iloc[:,0]

0            0
1            1
2            2
3            3
4            4
         ...  
71755    71755
71756    71756
71757    71757
71758    71758
71759    71759
Name: Unnamed: 0, Length: 71760, dtype: int64

On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the country column from just the first, second, and third row, we would do:

In [18]:
titanic.iloc[:3,0]

0    1
1    2
2    3
Name: PassengerId, dtype: int64

Or, to select just the second and third entries, we would do:

In [24]:
titanic.iloc[1:3,4]

1    female
2    female
Name: Sex, dtype: object

It's also possible to pass a list:

In [23]:
titanic.iloc[[3,4,5],3]

3    Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                        Allen, Mr. William Henry
5                                Moran, Mr. James
Name: Name, dtype: object

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values.

In [21]:
titanic.iloc[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


#### Label-based selection

The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [26]:
titanic.loc[0,'Name']

'Braund, Mr. Owen Harris'

iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. For example, here's one operation that's much easier using loc:

In [29]:
titanic.loc[:, ['Name','Cabin']]

Unnamed: 0,Name,Cabin
0,"Braund, Mr. Owen Harris",
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",C85
2,"Heikkinen, Miss. Laina",
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",C123
4,"Allen, Mr. William Henry",
...,...,...
886,"Montvila, Rev. Juozas",
887,"Graham, Miss. Margaret Edith",B42
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",C148


## Manipulating the index

set_index can be used to make a local data index into the index, can be used if the data has a better index.

In [30]:
titanic.set_index('Name')

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
"Heikkinen, Miss. Laina",3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
"Futrelle, Mrs. Jacques Heath (Lily May Peel)",4,1,1,female,35.0,1,0,113803,53.1000,C123,S
"Allen, Mr. William Henry",5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
"Montvila, Rev. Juozas",887,0,2,male,27.0,0,0,211536,13.0000,,S
"Graham, Miss. Margaret Edith",888,1,1,female,19.0,0,0,112053,30.0000,B42,S
"Johnston, Miss. Catherine Helen ""Carrie""",889,0,3,female,,1,2,W./C. 6607,23.4500,,S
"Behr, Mr. Karl Howell",890,1,1,male,26.0,0,0,111369,30.0000,C148,C


## Conditional selection