# Module 1.3 Scientific Modules II

## Table of contents

1. [Table of contents](#Table-of-contents)
2. [Scientific Modules II](#Scientific-Modules-II)
    1. [Pandas](#Pandas)
       1. [Data Structures](#Data-Structures)
          1. [Series](#Series)
          2. [DataFrames](#DataFrames)
       3. [DataFrame methods](#DataFrame-methods)
       4. [Accessing Series and DataFrames](#Accessing-Series-and-DataFrames)
          1. [Accessing Elements in a Series](#Accessing-Elements-in-a-Series)
          2. [Accessing Elements in a DataFrame](#Accessing-Elements-in-a-DataFrame)
          3. [Accessing (parts of) rows and columns](#Accessing-(parts-of)-rows-and-columns)
          4. [Adding rows or columns](#Adding-rows-or-columns)
       5. [Pandas and files](#Pandas-and-files)
3. [Exercises](#Exercises)
    1. [Exercise 54 - Pandas DataFrames](#Exercise-54---Pandas-DataFrames)
    2. [Exercise 55 - DataFrame methods](#Exercise-55---DataFrame-methods)
    3. [Exercise 56 - Single elements in DataFrames](#Exercise-56---Single-elements-in-DataFrames)
    4. [Exercise 57 - Modifying DataFrames](#Exercise-57---Modifying-DataFrames)
    5. [Exercise 58 - Pandas and Files](#Exercise-58---Pandas-and-Files)

# Scientific Modules II

## Pandas

Pandas is used to work with tables, i.e. DataFrames

- supports file formates like .csv, .xlsx, .sql, .json, .parquet
- data manipulation on column work elementwise
- straightforward extending/editing of DataFrame

Standard import:

    import pandas as pd

In [1]:
#installed pandas via the terminal: `pip3 install pandas`
#at the frist import I got a warning that `Pyarrow`will soon be a dependency of pandas, so just to be sure I install that as well?

In [2]:
import pandas as pd
#I also import NumPy since I need it in the examples
import numpy as np

### Data Structures


Data structures

- series (one dimensional):
  seen as a one column data frame with row names
- DataFrame (two dimensional):
  has row names and column names


> In pandas **index** means **row name**

Notes on naming the rows and columns: 

- duplicate names in row and column names are possible but can lead to difficulties later (e.g. when trying to sort according to a column name that isn't unique)
- the names are type sensitive! A column name 8 (type: int) will not be found when looking for "8" (type: str)   

#### Series



One dimensional, interpreted as a single column with row names -> pandas functions work on it. Similar to lists and tuples.

Creating an empty series: 

    series1 = pd.Series()

Data can be added in form of an iterable (list, tuple etc.). The first argument *always* has to be the data. It prefers to have the same datatype within one column, but it accepts mixed data types.

Row names can be created using the argument `index = []`. The same number of arguments are needed for data and index!! If not specified the row names are numbers starting at 0. For ease of handling try to always name the rows using strings!


In [3]:
series1 = pd.Series([1, 3, 5, None, 6, 8],index=["A","B","C","D","E","F"])
print(series1)

#Note that the placehiolder `None` is recognized as float, which turns the rest of the numbers into floats as well.

A    1.0
B    3.0
C    5.0
D    NaN
E    6.0
F    8.0
dtype: float64


In [4]:
print(series1[0])
print(series1["A"])

1.0
1.0


  print(series1[0])


#### DataFrames



Two dimensional with row and column names

Creation of an empty DataFrame:

    df1 = pd.DataFrame()

Data can already be added at the moment of creation, e.g. by putting in an ndarray (e.g. np.random.randn(2,4))

Row names can be named using `index = []`, columns can be named using `columns = []` 

If I define `index` and `columns` without data I create an empty DataFrame that is already defined

If the columns or row names need to be changed, this is done with a dictionary, where the key is the current and the value the new name:

    df = df.rename(columns = {"OldColName1":"NewColName1","OldColName2":"NewColName2"}, index = {"OldRowName1":"NewRowName1"})

In [5]:
dataframe1 = pd.DataFrame(np.random.randn(3,4), index=["Test1","Test2","Test3"], columns=["A","B","C","D"])
print(dataframe1)

              A         B         C         D
Test1  0.745268  0.889983  0.148750  1.502688
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240


In [6]:
dataframe1b = dataframe1.rename(columns = {"A":"Hello1","B":"Hello2"}, index = {"Test1":"New"})
print(dataframe1b)

         Hello1    Hello2         C         D
New    0.745268  0.889983  0.148750  1.502688
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240


### DataFrame methods

Generally the DataFrame methods work column wise. By specifying `axis = 1` this behaviour changes to row wise

`df.sum()`    
sums up all the values in the columns, returns a series with column names as row names.

`df.sum(axis = 1)`    
sums up all the values in the rows, returns a series with row names as row names.

`df.describe()`   
described the df: count, mean, stdv, min, 25%, 50%, 75%, max     
note that in count `nan` is not counted, so we immediately see missing values

`df.sort_index()`   
default works row wise not column wise! Data is sorted according to the row names, with axis = 1 according to column names.   
default works ascending, for descending add `ascending = False`    
does not work in-place but creates a new DataFrame    
   

`df.sort_values(by = "columnname")` or `df.sort_values(["colname"])`      
Sorts the df by the values in a specified column. To sort by row add `axis = 1`.   
Sorting according to several columns: `df.sort_values(["colname1","colname2"])`     
does not work in-place but creates a new DataFrame 

`df.index`   
Without `()`! Gives all row names.    
We can even access individual names with `[]`. Behaves a bit like a list, but *isn't* a list. List methods won't work.

`df.columns`   
Without `()`! Gives all column names.    
We can even access individual names with `[]`. Behaves a bit like a list, but *isn't* a list. List methods won't work.


index and column names are *Attributes* of our DataFrame. Attributes are always used without `()`

### Accessing Series and DataFrames

#### Accessing Elements in a Series

Elements in a series are accessed using `[]` with the rowname:

    series1[Rowname]
    
Only a single element can be accessed this way

Elements in series *can* be accessed using indices `series[idx]` starting at 0, but this is **not recommended**    
-> it doesn't work as soon as there is a single number in the rownames as there will then be a conflict!

To access elements using indices use

    series1.iloc[idx]

This supports slices as well (but not tuples)

In [7]:
#The index and the row name both work because all row names are strings. 
#If the rownames are not defined the default is equal to the index
print(series1)
print()
print(series1["A"])
print(series1.iloc[0])
print(series1.iloc[0:2])

A    1.0
B    3.0
C    5.0
D    NaN
E    6.0
F    8.0
dtype: float64

1.0
1.0
A    1.0
B    3.0
dtype: float64


#### Accessing Elements in a DataFrame

There are two methods to access elements in a DataFrame:

##### df.loc()



`df.loc` uses the colname and rowname to access an element

    df.loc[Rowname,Colname]
    
e.g.: `df1.loc["Row1","Col2"]` accesses the element in the row called "Row1" and column called "Col2" of df1





In [8]:
#Example:
print(dataframe1)
print()
print(dataframe1.loc["Test2",("A")])

              A         B         C         D
Test1  0.745268  0.889983  0.148750  1.502688
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240

-0.5320486676460422


##### df.iloc()

`df.iloc` uses the indices similar to accessing multidimensional lists, indices start at 0. Two ways of writing it:
    
    df.iloc[row_idx].iloc[col_idx]
    df.iloc[row_idx,col_idx]

`df.iloc[3].iloc[2]` or `df.iloc[3,2]` accesses the element in the fourth row and third column (indices star at 0)   

The syntax `df.iloc[row_idx][col_idx]` works as well, *but only* as long as the column names are either all *unique strings* or *haven't been defined* (i.e. are the same as the index)

Why this is so:    
`df1.iloc[3]` will create a series of the fourth row with the colnames of df1 as row names. (see below accessing rows and columns)     
Since `series[x]` looks for rownames before indices the correct use would be `df.iloc[idx][Rowname]`   
`df.iloc[idx][idx]` will be depreciated since it leads to errors when duplicate row names are present or there are numbers in the row names.

**For** `iloc` **always use the syntax** `df.iloc[idx_row].iloc[idx_col]` **or** `df.iloc[row_idx,col_idx]`

In [9]:
#Examples:
print(dataframe1)
print()
print(dataframe1.iloc[1].iloc[2])
print(dataframe1.iloc[1,2])

              A         B         C         D
Test1  0.745268  0.889983  0.148750  1.502688
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240

0.4509102918586784
0.4509102918586784


#### Accessing (parts of) rows and columns

> Note: if a single row or column is accessed the returned object is a series where the rownames are the former colnames (if a row was accessed) or rownames (if a column was accessed) of the df. When at least two rows and two columns are accessed a DataFrame is returned.

##### using loc()

Several rows or columns can be accessed with a tuple or a slice.    
First and last row in a slice may be left undefined -> for a whole row or column put simply `:`

Examples:

- to access a single row, replace the column with `:`    
  `df.loc[Rowname,:]`    
  may be shortened to    
  `df.loc[Rowname]`
- to access a single column, replace the row with `:`    
  `df.loc[:,Colname]`
- to access two entire rows, replace the column with `:` and add the rownames as a tuple    
  `df.loc[(Rowname1, Rowname2),:]`
- to access a defined subset add the rownames and colnames as tuples    
  `df.loc[(Rowname1, Rowname2),(Colname3,Colname4)]`

> Note: if a single row or column is accessed the returned object is a series where the rownames are the former colnames (if a row was accessed) or rownames (if a column was accessed) of the df. When at least two rows and two columns are accessed a DataFrame is returned.


In [10]:
#Examples for loc
print(dataframe1)
print()
#only the row Test2
print(dataframe1.loc["Test2"])
#only the column "A"
print(dataframe1.loc[:,"A"])
#both column "A" and "B"
print(dataframe1.loc[:,("A","B")])
#columns "A" to "C"
print(dataframe1.loc[:,"A":"C"])
#row "Test" till end
print(dataframe1.loc["Test2":])

              A         B         C         D
Test1  0.745268  0.889983  0.148750  1.502688
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240

A   -0.532049
B   -0.925582
C    0.450910
D    0.236353
Name: Test2, dtype: float64
Test1    0.745268
Test2   -0.532049
Test3   -0.366323
Name: A, dtype: float64
              A         B
Test1  0.745268  0.889983
Test2 -0.532049 -0.925582
Test3 -0.366323 -0.167643
              A         B         C
Test1  0.745268  0.889983  0.148750
Test2 -0.532049 -0.925582  0.450910
Test3 -0.366323 -0.167643 -1.088904
              A         B         C         D
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240


##### Using iloc

Only use the syntax `df.iloc[row,col]`   
Several rows or columns can be accessed with a list or a slice.    
All rows or columns are accessed with a `:`

Examples:

- slicing    
  `df.iloc[ridx1:ridx2, cidx1:cidx2]`   
- to access a single row:    
  `df.iloc[row_idx,:]`    
  can be abbreviated to `df.iloc[row_idx,]` or `df.iloc[row_idx]`
- to access a single column:    
  `df.iloc[:,col_idx]`    
  cannot be abbreviated!
- integer list with double []    
  `df.iloc[ [ridx1,ridx2] , [cidx1,cidx2] ]` 


In [11]:
print(dataframe1)
print()
#Accessing the rows
print(dataframe1.iloc[1,:])
print(dataframe1.iloc[1])
print()
#Accessing the colummns
print(dataframe1.iloc[:,0])
print()
#Sclicing and integer list
print(dataframe1.iloc[:2,1:3])
print(dataframe1.iloc[[0,2],[1,3]])

              A         B         C         D
Test1  0.745268  0.889983  0.148750  1.502688
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240

A   -0.532049
B   -0.925582
C    0.450910
D    0.236353
Name: Test2, dtype: float64
A   -0.532049
B   -0.925582
C    0.450910
D    0.236353
Name: Test2, dtype: float64

Test1    0.745268
Test2   -0.532049
Test3   -0.366323
Name: A, dtype: float64

              B        C
Test1  0.889983  0.14875
Test2 -0.925582  0.45091
              B         D
Test1  0.889983  1.502688
Test3 -0.167643  0.615240


#### Adding rows or columns



Rows or columns can be added using `df.loc()`. If a value is placed in a row and/or column that doesn't exist the row/column is created in the placement of the value. The other fields are left empty (`NaN`). This doesn't work with `iloc`

General Syntax:    
`df.loc["NewRow","NewColumn"] = value`

for an empty row/column add the value `None` and replace the column/row with `:`. For a new empty column:    
`df.loc[:,"NewColumn"] = None`
   
for a new column with several entries exactly matching the number of rows:    
`df.loc[:,"NewColumn"] = [1,2,3,4]`

otherwise the rows (existing or new) need to be defined    
`df.loc[("A","B"),"NewColumn"] = [1,2]`


> **A NOTE ON MISSING NUMBERS**   
> NumPy np.nan and pandas NaN and pd.NA are not (quite) the same!!   
> the functinon to find empty values can depend on the type of `None` value (e.g. pd.notnull() for pandas)
>
> NumPy `None` (`np.nan`) happens e.g. if we can't do a calculation    
> pandas `None` (`NaN`) happens e.g. when creating an empty DataFrame/row/column    
> pandas `None` (`<NA>`) can be created with `pd.NA`

In [12]:
print(dataframe1)
dataframe1.loc["Test1","E"] = "new"
print(dataframe1)
dataframe1.loc["Test5","F"] = None
print(dataframe1)
dataframe1.loc["Test5","G"] = pd.NA
print(dataframe1)

# Note that the row Test5 now has 3 versions of empty numbers!!

              A         B         C         D
Test1  0.745268  0.889983  0.148750  1.502688
Test2 -0.532049 -0.925582  0.450910  0.236353
Test3 -0.366323 -0.167643 -1.088904  0.615240
              A         B         C         D    E
Test1  0.745268  0.889983  0.148750  1.502688  new
Test2 -0.532049 -0.925582  0.450910  0.236353  NaN
Test3 -0.366323 -0.167643 -1.088904  0.615240  NaN
              A         B         C         D    E     F
Test1  0.745268  0.889983  0.148750  1.502688  new   NaN
Test2 -0.532049 -0.925582  0.450910  0.236353  NaN   NaN
Test3 -0.366323 -0.167643 -1.088904  0.615240  NaN   NaN
Test5       NaN       NaN       NaN       NaN  NaN  None
              A         B         C         D    E     F     G
Test1  0.745268  0.889983  0.148750  1.502688  new   NaN   NaN
Test2 -0.532049 -0.925582  0.450910  0.236353  NaN   NaN   NaN
Test3 -0.366323 -0.167643 -1.088904  0.615240  NaN   NaN   NaN
Test5       NaN       NaN       NaN       NaN  NaN  None  <NA>


### Pandas and files

commands to save DataFrames:

- df.to_xml
- df.to_csv
- df.to_excel
- etc.
  
commands to read in files:

- read_xml
- read_csv
- read_excel
- read_json

To read in the first spread sheet in an excel file (if it is in the same directory as my script)    
`data1 = pd.read_excel("Cation.xlsx")`

- per default column names are taken over, row names only if specified with `index_col`     
  for the first column: `index_col = 0`
- spreadsheet can be specified with the command `sheet_name`    
  if this is set to `None` we get the entire file with all contents as a big dictionary
- if my file is in a different directory, use absolute path to access or use the module `os` to change the working directory

to use the excel functions the module `openpyxl` has to be installed


In [13]:
file1 = pd.read_excel("/Users/Lisa/Documents/Lisa/Arbeit/FortbildungBioinformatik/ABI-2024-1_Course-Materials/Teaching_materials/Module 1 - Introduction to programming and python/1.3 Introduction to programming and Python/ABI_Files-Python/Cation.xlsx")   
print("#### default: the first spreadsheet ####")
print(file1)

#### default: the first spreadsheet ####
   Unnamed: 0       ModBase   SwissModel        T0863      T0872        T0886  \
0        #Yes   2812.000000  1134.000000   486.000000  482.00000   486.000000   
1         #No    384.000000   111.000000     5.000000   21.00000    10.000000   
2       #Mean     17.837838    26.679012    51.777778    4.93361     8.691358   
3         HIS   5605.000000  3552.000000  2865.000000  302.00000   372.000000   
4         PHE   6551.000000  3448.000000  4740.000000  574.00000  1273.000000   
5         TRP   3362.000000  1922.000000  1282.000000   71.00000     0.000000   
6         TYR   5351.000000  3178.000000  3627.000000  588.00000  1079.000000   
7         ARG  11664.000000  7254.000000  4243.000000  302.00000   461.000000   
8         ASN   3520.000000  1705.000000  3202.000000  528.00000  1034.000000   
9         GLN   3825.000000  2142.000000  2527.000000  590.00000   743.000000   
10        LYS   2264.000000  1266.000000  2540.000000  129.00000   7

In [14]:
print("#### whole notebook with sheet_name = None ####")
file2 = pd.read_excel("/Users/Lisa/Documents/Lisa/Arbeit/FortbildungBioinformatik/ABI-2024-1_Course-Materials/Teaching_materials/Module 1 - Introduction to programming and python/1.3 Introduction to programming and Python/ABI_Files-Python/Cation.xlsx", sheet_name=None)   
print(file2)

#### whole notebook with sheet_name = None ####
{'Hits':    Unnamed: 0       ModBase   SwissModel        T0863      T0872        T0886  \
0        #Yes   2812.000000  1134.000000   486.000000  482.00000   486.000000   
1         #No    384.000000   111.000000     5.000000   21.00000    10.000000   
2       #Mean     17.837838    26.679012    51.777778    4.93361     8.691358   
3         HIS   5605.000000  3552.000000  2865.000000  302.00000   372.000000   
4         PHE   6551.000000  3448.000000  4740.000000  574.00000  1273.000000   
5         TRP   3362.000000  1922.000000  1282.000000   71.00000     0.000000   
6         TYR   5351.000000  3178.000000  3627.000000  588.00000  1079.000000   
7         ARG  11664.000000  7254.000000  4243.000000  302.00000   461.000000   
8         ASN   3520.000000  1705.000000  3202.000000  528.00000  1034.000000   
9         GLN   3825.000000  2142.000000  2527.000000  590.00000   743.000000   
10        LYS   2264.000000  1266.000000  2540.00000

# Exercises

## Exercise 54 - Pandas DataFrames


Write a program that creates multiple DataFrames. The DataFrames should have the following dimensions:

1. 6 rows and 4 columns
2. 4 rows and 8 columns
3. 10 rows and 3 columns
4. 5 rows and 9 columns

> changed rownames, default colnames

In [15]:
df1 = pd.DataFrame(np.random.randint(-10, 10, size=(6,4)),index=["I","add","6","row","names","manually"], columns = ["A","B","C","D"])
print(df1)

          A  B  C  D
I        -8 -1 -2  7
add      -5  8  2 -9
6        -8 -8  0 -7
row       8  2  6  5
names     3  0  9  8
manually  6 -4 -2  5


> changed rownames and colnames by defining them as alist before. Duplicates are allowed, even in both! This might lead to problems down the line though!

In [16]:
rownames1 = ["My","row","row","names"]
colnames1 = ["I","need",8,"colnames","I","can","choose","duplicates!"]
df2 = pd.DataFrame(np.random.randint(0,100, size=(4,8)),index= rownames1, columns = colnames1)
print(df2)

        I  need   8  colnames   I  can  choose  duplicates!
My     28    98  48        22  77    1       5           84
row    63    20  15        11  54   21      13           15
row    70    18  59        12  17   74      20           65
names  11    69  11         3  93   32      75            6


> changed column names

In [17]:
df3 = pd.DataFrame(np.random.randint(5, 50, size=(10,3)), columns = ["My","column","names"])
print(df3)

   My  column  names
0  17      25     27
1  31      24     47
2  23      48     12
3  45      33     40
4  29      32     13
5  23      33     17
6  13      37     33
7  21      43     29
8  27      49     14
9   6      11     10


> I create functions for my row and column names

In [18]:
rownames2 = ["Row"+str(i) for i in range(1,6)]
colnames2 = ["Col"+str(i) for i in range(1,10)]
df4 = pd.DataFrame(index = rownames2, columns = colnames2)
print(df4)

     Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
Row1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row5  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN


> In large df the middle is cut of with `...`, if it goes on into a 2nd line this is shown with `\`


In [19]:
#a very large df
df5 = pd.DataFrame(np.random.randint(0,100, size=(4,50)))
print(df5)

   0   1   2   3   4   5   6   7   8   9   ...  40  41  42  43  44  45  46  \
0  20  43  86  83  37  38  28  65  93   4  ...   1  91  18  26  85  93  62   
1   8  67  36  41  41  57   0  73  36  49  ...  73  39   9  39  32  70  85   
2  69  95  78  66  34  55  35  92  40  19  ...  95  36  64  16  53  12  38   
3  46   0  61  57  25  94  27  85   2  73  ...  80  41  42   5  49  26  35   

   47  48  49  
0  71   4  83  
1  70  73  23  
2  12  41   3  
3  22  29  89  

[4 rows x 50 columns]


## Exercise 55 - DataFrame methods


Use your DataFrames from Exercise 54 to fulfill the following tasks:

1. Calculate the columnwise sum for the first DataFrame
2. Calculate the rowwise sum for the second DataFrame
3. Sort the fourth DataFrame according to the values in the first and third column
4. Create a description of the third DataFrame
5. Print out the row and column names of the second DataFrame

In [20]:
#Calculate the columnwise sum for the first DataFrame
print(df1)
print()
print(df1.sum())

          A  B  C  D
I        -8 -1 -2  7
add      -5  8  2 -9
6        -8 -8  0 -7
row       8  2  6  5
names     3  0  9  8
manually  6 -4 -2  5

A    -4
B    -3
C    13
D     9
dtype: int64


In [21]:
#Calculate the rowwise sum for the second DataFrame
print(df2)
print()
print(df2.sum(axis=1))

        I  need   8  colnames   I  can  choose  duplicates!
My     28    98  48        22  77    1       5           84
row    63    20  15        11  54   21      13           15
row    70    18  59        12  17   74      20           65
names  11    69  11         3  93   32      75            6

My       363
row      212
row      335
names    300
dtype: int64


In [22]:
#Sort the fourth DataFrame according to the values in the first and third column
#df4 is an empty dataframe... I use df2 instead, and since the first column is a duplicate I use the second
#note that the columename 8 was an integer and is not recognized as string "8"
print(df2)
print()
print(df2.sort_values(["need",8]))

        I  need   8  colnames   I  can  choose  duplicates!
My     28    98  48        22  77    1       5           84
row    63    20  15        11  54   21      13           15
row    70    18  59        12  17   74      20           65
names  11    69  11         3  93   32      75            6

        I  need   8  colnames   I  can  choose  duplicates!
row    70    18  59        12  17   74      20           65
row    63    20  15        11  54   21      13           15
names  11    69  11         3  93   32      75            6
My     28    98  48        22  77    1       5           84


In [23]:
#Create a description of the third DataFrame
print(df3)
print()
print(df3.describe())

   My  column  names
0  17      25     27
1  31      24     47
2  23      48     12
3  45      33     40
4  29      32     13
5  23      33     17
6  13      37     33
7  21      43     29
8  27      49     14
9   6      11     10

             My     column     names
count  10.00000  10.000000  10.00000
mean   23.50000  33.500000  24.20000
std    10.67968  11.664285  12.95119
min     6.00000  11.000000  10.00000
25%    18.00000  26.750000  13.25000
50%    23.00000  33.000000  22.00000
75%    28.50000  41.500000  32.00000
max    45.00000  49.000000  47.00000


In [24]:
#Print out the row and column names of the second DataFrame
print(df2.index)
print(df2.columns)

Index(['My', 'row', 'row', 'names'], dtype='object')
Index(['I', 'need', 8, 'colnames', 'I', 'can', 'choose', 'duplicates!'], dtype='object')


## Exercise 56 - Single elements in DataFrames


Use your DataFrames from Exercise 54 to fulfill the following tasks:

1. Print out the element in the first column and first row of DataFrame1
2. Print out the element in the third column and the second row of DataFrame2
3. Print out the fifth row of DataFrame3
4. Print out the sixth column of DataFrame4
   
Use both versions (loc and iloc to fulfill the tasks)

In [25]:
#Print out the element in the first column and first row of DataFrame1
print(df1)
print(df1.iloc[0,0])
print(df1.iloc[0].iloc[0])
print(df1.loc["I","A"])

          A  B  C  D
I        -8 -1 -2  7
add      -5  8  2 -9
6        -8 -8  0 -7
row       8  2  6  5
names     3  0  9  8
manually  6 -4 -2  5
-8
-8
-8


In [26]:
#Do not use this syntax!
print(df1)
print(df1.iloc[0][0])

          A  B  C  D
I        -8 -1 -2  7
add      -5  8  2 -9
6        -8 -8  0 -7
row       8  2  6  5
names     3  0  9  8
manually  6 -4 -2  5
-8


  print(df1.iloc[0][0])


why `df1.iloc[0][0]` gives a warning:

`df1.iloc[idx]` creates a series out of the row with index = idx of df1. The *colnames* of df1 are turned into the *rownames* of this series

the syntax `series[x]` wants x to be rownames.      
If x are *indices* this works only if there are *absolutely no numbers (or duplicates?)* in the rownames of the series (which waere the colnames of df1)

In [27]:
print(df1)
print(df1.iloc[0]["A"])

          A  B  C  D
I        -8 -1 -2  7
add      -5  8  2 -9
6        -8 -8  0 -7
row       8  2  6  5
names     3  0  9  8
manually  6 -4 -2  5
-8


In [28]:
#Print out the element in the third column and the second row of DataFrame2
#the second row has a non-unique name: use first row for loc
#because of the duplicates I strictly need the iloc.iloc notation here!!
print(df2)
print(df2.loc["My",8])
print(df2.iloc[0].iloc[2])
print(df2.iloc[1].iloc[2])
#print(df2.iloc[0][2])

        I  need   8  colnames   I  can  choose  duplicates!
My     28    98  48        22  77    1       5           84
row    63    20  15        11  54   21      13           15
row    70    18  59        12  17   74      20           65
names  11    69  11         3  93   32      75            6
48
48
15


In [29]:
#Print out the fifth row of DataFrame4
#adding list shows the print out in a single line
print(df4)
print()

#for `loc` accessing rows can be done in two ways:
print(df4.loc["Row5",:])
print(list(df4.loc["Row5"]))

#for iloc we have a few more
print(list(df4.iloc[4,:]))
print(list(df4.iloc[4]))
print(list(df4.iloc[4].iloc[:]))
print(list(df4.iloc[4][:]))
print(list(df4.iloc[:].iloc[1]))


     Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
Row1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row5  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Col1    NaN
Col2    NaN
Col3    NaN
Col4    NaN
Col5    NaN
Col6    NaN
Col7    NaN
Col8    NaN
Col9    NaN
Name: Row5, dtype: object
[nan, nan, nan, nan, nan, nan, nan, nan, nan]
[nan, nan, nan, nan, nan, nan, nan, nan, nan]
[nan, nan, nan, nan, nan, nan, nan, nan, nan]
[nan, nan, nan, nan, nan, nan, nan, nan, nan]
[nan, nan, nan, nan, nan, nan, nan, nan, nan]
[nan, nan, nan, nan, nan, nan, nan, nan, nan]


In [30]:
#Print out the sixth column of DataFrame4
print(df4)
print()
#for loc accessing rows can be done in two ways:
print(df4.loc[:,"Col6"])
print()
print(list(df4.iloc[:,5]))

     Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
Row1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row5  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Row1    NaN
Row2    NaN
Row3    NaN
Row4    NaN
Row5    NaN
Name: Col6, dtype: object

[nan, nan, nan, nan, nan]


## Exercise 57 - Modifying DataFrames


Use your DataFrames from Exercise 54 to fulfill the following tasks:

1. Place a new element in the fifth row of a new column into DataFrame1
2. Place a new element in the fourth column of a new row into DataFrame2
3. Place a new element in the seventh row of a new column into DataFrame3
4. Place a new element into a new row and new column into DataFrame4

In [31]:
print(df1)
print()
df1.loc["names",4] = 24
print(df1)

          A  B  C  D
I        -8 -1 -2  7
add      -5  8  2 -9
6        -8 -8  0 -7
row       8  2  6  5
names     3  0  9  8
manually  6 -4 -2  5

          A  B  C  D     4
I        -8 -1 -2  7   NaN
add      -5  8  2 -9   NaN
6        -8 -8  0 -7   NaN
row       8  2  6  5   NaN
names     3  0  9  8  24.0
manually  6 -4 -2  5   NaN


In [32]:
print(df2)
print()
df2.loc["new","colnames"] = 50
print(df2)

        I  need   8  colnames   I  can  choose  duplicates!
My     28    98  48        22  77    1       5           84
row    63    20  15        11  54   21      13           15
row    70    18  59        12  17   74      20           65
names  11    69  11         3  93   32      75            6

          I  need     8  colnames     I   can  choose  duplicates!
My     28.0  98.0  48.0      22.0  77.0   1.0     5.0         84.0
row    63.0  20.0  15.0      11.0  54.0  21.0    13.0         15.0
row    70.0  18.0  59.0      12.0  17.0  74.0    20.0         65.0
names  11.0  69.0  11.0       3.0  93.0  32.0    75.0          6.0
new     NaN   NaN   NaN      50.0   NaN   NaN     NaN          NaN


In [33]:
print(df3)
print()
df3.loc[6,"new"] = 100
print(df3)

   My  column  names
0  17      25     27
1  31      24     47
2  23      48     12
3  45      33     40
4  29      32     13
5  23      33     17
6  13      37     33
7  21      43     29
8  27      49     14
9   6      11     10

   My  column  names    new
0  17      25     27    NaN
1  31      24     47    NaN
2  23      48     12    NaN
3  45      33     40    NaN
4  29      32     13    NaN
5  23      33     17    NaN
6  13      37     33  100.0
7  21      43     29    NaN
8  27      49     14    NaN
9   6      11     10    NaN


In [34]:
print(df4)
print()
df4.loc["Row6","Col9"] = "Wow a value"
print(df4)

     Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
Row1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Row5  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

     Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8         Col9
Row1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN          NaN
Row2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN          NaN
Row3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN          NaN
Row4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN          NaN
Row5  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN          NaN
Row6  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  Wow a value


## Exercise 58 - Pandas and Files

Use your DataFrame from Exercise 54 and save it to a csv-file and an Excel-file. Try to read these files with the help of pandas afterwards.

In [35]:
print(df1)

          A  B  C  D     4
I        -8 -1 -2  7   NaN
add      -5  8  2 -9   NaN
6        -8 -8  0 -7   NaN
row       8  2  6  5   NaN
names     3  0  9  8  24.0
manually  6 -4 -2  5   NaN


In [36]:
df1.to_csv("Test_df1.csv")
df1.to_excel("Test_df1.xlsx")

In [37]:
df1_new1 = pd.read_csv("Test_df1.csv")
df1_new2 = pd.read_excel("Test_df1.xlsx")

print(df1_new1)
print()
print(df1_new2)

  Unnamed: 0  A  B  C  D     4
0          I -8 -1 -2  7   NaN
1        add -5  8  2 -9   NaN
2          6 -8 -8  0 -7   NaN
3        row  8  2  6  5   NaN
4      names  3  0  9  8  24.0
5   manually  6 -4 -2  5   NaN

  Unnamed: 0  A  B  C  D     4
0          I -8 -1 -2  7   NaN
1        add -5  8  2 -9   NaN
2          6 -8 -8  0 -7   NaN
3        row  8  2  6  5   NaN
4      names  3  0  9  8  24.0
5   manually  6 -4 -2  5   NaN
