Data Carpentry Workshop  
August 15-16, 2019  
Smithsonian Castle Library

# Python: Data Types and Formats

This notebook contains material presented during Day 2 portion of Python. The official course materials are available here: https://datacarpentry.org/python-ecology-lesson/04-data-types-and-format/index.html

In [1]:
import pandas as pd

In [2]:
surveys_df = pd.read_csv('https://ndownloader.figshare.com/files/2292172')

In [3]:
type(surveys_df)

pandas.core.frame.DataFrame

`dtype('O')` = type Object = text (like a str in normal Python)

In [4]:
surveys_df['sex'].dtype

dtype('O')

In [5]:
surveys_df['record_id'].dtype

dtype('int64')

In [6]:
surveys_df.dtypes

record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

Convert from float to int, and convert from int to float. (These functions don't work with DataFrame columns.)

In [7]:
a = 7.83
b = 7

In [8]:
int(a)

7

In [9]:
float(b)

7.0

Convert DataFrame column data types with `df.astype()`.

In [10]:
surveys_df['record_id'] = surveys_df['record_id'].astype('float64')

In [11]:
surveys_df['plot_id']=surveys_df['plot_id'].astype('float64')

In [12]:
surveys_df.dtypes

record_id          float64
month                int64
day                  int64
year                 int64
plot_id            float64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

You get an error when you attempt to do this: `ValueError: Cannot convert non-finite values (NA or inf) to integer`. Can't convert null values to numbers!

In [13]:
surveys_df['weight'] = surveys_df['weight'].astype('int64')

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-931e9a157131>", line 1, in <module>
    surveys_df['weight'] = surveys_df['weight'].astype('int64')
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5001, in astype
    **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3714, in astype
    return self.apply('astype', dtype=dtype, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3581, in apply
    applied = getattr(b, f)(**kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py", line 575, in astype
    **kwargs)
  File "C:\ProgramData\Anacon

ValueError: Cannot convert non-finite values (NA or inf) to integer

In [14]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1.0,7,16,1977,2.0,NL,M,32.0,
1,2.0,7,16,1977,3.0,NL,M,33.0,
2,3.0,7,16,1977,2.0,DM,F,37.0,
3,4.0,7,16,1977,7.0,DM,M,36.0,
4,5.0,7,16,1977,3.0,DM,M,35.0,


In [15]:
surveys_df['weight'].mean()

42.672428212991356

`null_weight` is a **boolean mask**; it will return a series of TRUE and FALSE values for each row, depending on whether weight is null. YOu can supply a DataFrame with this boolean mask to filter it. It will return only results where the row value is TRUE. `len()` provides a count of the rows in the DataFrame.

In [16]:
null_weight = pd.isnull(surveys_df.weight)

In [17]:
len(surveys_df[null_weight])

3266

`.copy()` creates a unique, new copy of a DataFrame. If you don't use copy, you're just creating a new reference to an existing DataFrame (and you could accidentally edit your original DF you didn't mean to edit).

In [18]:
df1 = surveys_df.copy()

`fillna()` allows you to specify a value to use to automatically fill in all NaNs or nulls.

In [19]:
df1.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1.0,7,16,1977,2.0,NL,M,32.0,
1,2.0,7,16,1977,3.0,NL,M,33.0,
2,3.0,7,16,1977,2.0,DM,F,37.0,
3,4.0,7,16,1977,7.0,DM,M,36.0,
4,5.0,7,16,1977,3.0,DM,M,35.0,


Notice that once NaNs have been replaced by 0.0, the mean of the weight column is affected.

In [20]:
df1.weight.mean()

42.672428212991356

In the weight column, set any null values equal to the mean weight for all observations in surveys_df. This way, the values will not affect the calculation of the mean. This is **not** always the correct solution for dealing with NaN values; this is just one thing that can be done.

In [21]:
df1 = surveys_df.copy()

In [22]:
df1['weight'] = df1['weight'].fillna(surveys_df['weight'].mean())

In [23]:
df1.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1.0,7,16,1977,2.0,NL,M,32.0,42.672428
1,2.0,7,16,1977,3.0,NL,M,33.0,42.672428
2,3.0,7,16,1977,2.0,DM,F,37.0,42.672428
3,4.0,7,16,1977,7.0,DM,M,36.0,42.672428
4,5.0,7,16,1977,3.0,DM,M,35.0,42.672428


*Challenge question*: Count the number of missing values per column. Hint: The method .count() gives you the number of non-NA observations per column. Try looking to the .isnull() method.

In [24]:
surveys_df.isnull().sum()

record_id             0
month                 0
day                   0
year                  0
plot_id               0
species_id          763
sex                2511
hindfoot_length    4111
weight             3266
dtype: int64

`dropna()` removes all rows from a DataFrame that have any null values in them (in any field).

In [25]:
df_na = surveys_df.dropna()

In [26]:
df_na.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
62,63.0,8,19,1977,3.0,DM,M,35.0,40.0
63,64.0,8,19,1977,7.0,DM,M,37.0,48.0
64,65.0,8,19,1977,4.0,DM,F,34.0,29.0
65,66.0,8,19,1977,4.0,DM,F,35.0,46.0
66,67.0,8,19,1977,7.0,DM,M,35.0,36.0


The `to_csv()` DataFrame method allows you to export a DataFrame as a CSV. You can specify a file name or a file path. `index=False` tells Pandas to remove the DataFrame index before exporting.

In [27]:
df_na.to_csv('surveys_complete.csv', index=False)