# Working with Pandas

Pandas is very useful when working with tables in Python. The main Pandas object is called DataFrame and can be created using the command `pd.DataFrame()` (if the Pandas package was imported as `import pandas as pd`). However, it is more common to use the `pd.read_csv()` function to load data from a file instead of creating it from scratch.

A lot of things happen when `read_csv()` is used, so it is adviced to run sanity checks to ensure the data was loaded as expected. Simplest things that can be done is checking how first and last rows look like by using functions like `head()` and `tail()`. You can also check the number and names of columns.

## Exercises

Download the dataset of animal species:
```bash
!curl -o species.csv https://raw.githubusercontent.com/MainakRepositor/Datasets/refs/heads/master/species.csv
```

1. Import Pandas. Load the downloaded CSV file with Pandas using `read_csv()` function and save its contents into a new variable.
2. Print out first 10 rows - there are two ways of doing that, either with slicing or with a Pandas function. Try both methods.
3. Print out last 10 rows with two different methods.
4. Print out random 10 rows using a relevant Pandas function.
5. Print out all column names.
6. Print out random 10 rows and only these columns: "Category", "Order", "Family", "Scientific Name".

In [2]:
# !pip install pandas




In [3]:
import pandas as pd

In [4]:
raw_data = pd.read_csv('species.csv')

  raw_data = pd.read_csv('species.csv')


In [7]:
raw_data.head(10)
raw_data.tail(10)

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status,Occurrence,Nativeness,Abundance,Seasonality,Conservation Status,Unnamed: 13
119238,ZION-2786,Zion National Park,Vascular Plant,Solanales,Solanaceae,Physalis hederifolia var. palmeri,Palmer's Ground-Cherry,Approved,Present,Native,Uncommon,,,
119239,ZION-2787,Zion National Park,Vascular Plant,Solanales,Solanaceae,Physalis longifolia,Long-Leaf Ground-Cherry,Approved,Present,Native,Rare,,,
119240,ZION-2788,Zion National Park,Vascular Plant,Solanales,Solanaceae,Solanum elaeagnifolium,Silverleaf Nightshade,Approved,Present,Native,Uncommon,,,
119241,ZION-2789,Zion National Park,Vascular Plant,Solanales,Solanaceae,Solanum nigrum,Black Nightshade,Approved,Present,Not Native,Uncommon,,,
119242,ZION-2790,Zion National Park,Vascular Plant,Solanales,Solanaceae,Solanum sarrachoides,Ground-Cherry Nightshade,Approved,Not Confirmed,Not Native,,,,
119243,ZION-2791,Zion National Park,Vascular Plant,Solanales,Solanaceae,Solanum triflorum,Cut-Leaf Nightshade,Approved,Present,Native,Uncommon,,,
119244,ZION-2792,Zion National Park,Vascular Plant,Vitales,Vitaceae,Vitis arizonica,Canyon Grape,Approved,Present,Native,Uncommon,,,
119245,ZION-2793,Zion National Park,Vascular Plant,Vitales,Vitaceae,Vitis vinifera,Wine Grape,Approved,Present,Not Native,Uncommon,,,
119246,ZION-2794,Zion National Park,Vascular Plant,Zygophyllales,Zygophyllaceae,Larrea tridentata,Creosote Bush,Approved,Present,Native,Rare,,,
119247,ZION-2795,Zion National Park,Vascular Plant,Zygophyllales,Zygophyllaceae,Tribulus terrestris,Puncture Vine,Approved,Present,Not Native,Uncommon,,,


In [9]:
raw_data.iloc[:10, :]
raw_data.loc [:10, 'Family'] # safier then raw_data['colunm name']

0       Cervidae
1       Cervidae
2        Canidae
3        Canidae
4        Canidae
5        Felidae
6        Felidae
7     Mephitidae
8     Mustelidae
9     Mustelidae
10    Mustelidae
Name: Family, dtype: object

In [11]:
raw_data.sample(10)

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status,Occurrence,Nativeness,Abundance,Seasonality,Conservation Status,Unnamed: 13
116220,YOSE-1856,Yosemite National Park,Vascular Plant,Brassicales,Brassicaceae,Lepidium virginicum var. virginicum,"Poorman's-Pepperwort, Virginia Pepperweed",Approved,Present,Not Native,Uncommon,,,
70444,KOVA-1695,Kobuk Valley National Park,Fungi,Baeomycetales,Baeomycetaceae,Baeomyces carneus,Cap Lichen,In Review,,,,,,
111548,YELL-1150,Yellowstone National Park,Bird,Charadriiformes,Laridae,Sterna caspia,Caspian Tern,Approved,Present,Native,Rare,Breeder,,
115035,YELL-4637,Yellowstone National Park,Insect,Trichoptera,Rhyacophilidae,Rhyacophila,,In Review,Present,Unknown,Unknown,,,
117949,ZION-1497,Zion National Park,Vascular Plant,Asparagales,Asparagaceae,Yucca utahensis,Utah Yucca,Approved,Present,Native,Common,,,
32097,EVER-1725,Everglades National Park,Fish,Perciformes,Gobiidae,Bathygobius,"Frillfin Gobies, Mapos",In Review,,,,,,
27739,DEVA-2654,Death Valley National Park,Vascular Plant,Caryophyllales,Amaranthaceae,Amaranthus fimbriatus,Fringed Amaranth,Approved,Present,Native,Unknown,,,
15832,CHIS-1974,Channel Islands National Park,Vascular Plant,Caryophyllales,Amaranthaceae,Salsola tragus,"Russian Thistle, Tumbleweed",Approved,Present,Not Native,Uncommon,,,
55017,GUMO-1815,Guadalupe Mountains National Park,Vascular Plant,Caryophyllales,Polygonaceae,Eriogonum abertianum,Albert Wild Buckwheat,Approved,Present,Native,Unknown,,,
105058,SHEN-5124,Shenandoah National Park,Fungi,Boletales,Boletaceae,Leccinum flavostipitatum,,Approved,Present,Unknown,Unknown,,,


In [12]:
cols = ["Category",	"Order", "Family"]
raw_data.sample(10)[cols]

Unnamed: 0,Category,Order,Family
104035,Vascular Plant,Sapindales,Anacardiaceae
43918,Vascular Plant,Poales,Cyperaceae
114595,Insect,Lepidoptera,Lycaenidae
70411,Vascular Plant,Saxifragales,Saxifragaceae
103592,Vascular Plant,Poales,Poaceae
10724,Bird,Passeriformes,Fringillidae
12153,Vascular Plant,Apiales,Apiaceae
50122,Insect,Hymenoptera,Aulacidae
50912,Insect,Lepidoptera,Saturniidae
101183,Bird,Passeriformes,Parulidae


# Pandas data types

Each column of a DataFrame has a data type, so all values of that column have this data type. You can change data types to automatically change values:

In [3]:
df_with_ints = pd.DataFrame([0, 0, 1, 1, 1, 0, 1], columns=['values'])
print(f'Dataframe has one column with values of type: {df_with_ints["values"].dtype}')
print(f'Column values: {df_with_ints["values"].values}\n')

# using astype()

df_with_ints['values_bool'] = df_with_ints['values'].astype(bool)
print(f'By changing column dtype to bool, it got dtype: {df_with_ints["values_bool"].dtype}')
print(f'Column values: {df_with_ints["values_bool"].values}\n')

df_with_ints['values_float'] = df_with_ints['values'].astype(float)
print(f'By changing column dtype to float, it got dtype: {df_with_ints["values_float"].dtype}')
print(f'Column values: {df_with_ints["values_float"].values}\n')

df_with_ints['values_str'] = df_with_ints['values'].astype(str)
print(f'By changing column dtype to str, it got dtype: {df_with_ints["values_str"].dtype}')
print(f'Column values: {df_with_ints["values_str"].values}\n')

# directly changing a value (will be deprecated soon)

df_with_ints['values_changed'] = df_with_ints['values'].copy()
df_with_ints.loc[0, 'values_changed'] = 'changed'
print(f'By directly changing one value to str, the column got dtype: {df_with_ints["values_changed"].dtype}')
print(f'Column values: {df_with_ints["values_changed"].values}\n')

Dataframe has one column with values of type: int64
Column values: [0 0 1 1 1 0 1]

By changing column dtype to bool, it got dtype: bool
Column values: [False False  True  True  True False  True]

By changing column dtype to float, it got dtype: float64
Column values: [0. 0. 1. 1. 1. 0. 1.]

By changing column dtype to str, it got dtype: object
Column values: ['0' '0' '1' '1' '1' '0' '1']

By directly changing one value to str, the column got dtype: object
Column values: ['changed' 0 1 1 1 0 1]



  df_with_ints.loc[0, 'values_changed'] = 'changed'


Same as for Numpy arrays, one value will affect the whole column:

In [None]:
#new_file = old_file.copy()

In [4]:
with open('test.csv', 'w') as f:
  f.writelines(["first_col,second_col,third_col\n", "1,2,3\n", "1,2,hello"])

df = pd.read_csv('test.csv')
print(df.dtypes)
df

first_col      int64
second_col     int64
third_col     object
dtype: object


Unnamed: 0,first_col,second_col,third_col
0,1,2,3
1,1,2,hello


If some data types of a loaded DataFrame are not expected, the `unique()` function is useful to see all unique values of a column and understand which one might have affected the data type.

## Exercises

1. Check data types of each column in the loaded dataframe.
2. There is a column called "Unnamed: 13". What is inside this column? Print out all its unique values.
3. What other column has the same values as this one? Using a FOR loop, print out unique values for each column.
4. Some columns have a lot of unique values. For each column, print only 5 most commonly repeated values together with their counts, using `value_counts()`.
5. It seems that some values from the "Conservation status" column leaked into the newly created "Unnamed: 13" column. Count how many non-empty values are in "Unnamed: 13".
6. Print out only those rows which have non-empty values in "Unnamed: 13". Try to understand what happenned - how these rows can be repaired so the values are relevant to columns?
7. Save indices of these rows into a new variable.
8. Load the same file without Pandas, by using `open()` and `readlines()` commands, then print out the same problematic rows only using the saved indices. Do you get the same rows?
9. Print out problematic rows using Species ID instead to ensure that the rows correspond to the ones found before.

In [23]:
# if markdown changed ctrl+enter
print(raw_data.dtypes)
raw_data['Unnamed: 13'].unique()
raw_data.columns

for cols in raw_data.columns:
    print(raw_data[cols].unique())

Species ID             object
Park Name              object
Category               object
Order                  object
Family                 object
Scientific Name        object
Common Names           object
Record Status          object
Occurrence             object
Nativeness             object
Abundance              object
Seasonality            object
Conservation Status    object
Unnamed: 13            object
dtype: object
['ACAD-1000' 'ACAD-1001' 'ACAD-1002' ... 'ZION-2793' 'ZION-2794'
 'ZION-2795']
['Acadia National Park' 'Arches National Park' 'Badlands National Park'
 'Big Bend National Park' 'Biscayne National Park'
 'Black Canyon of the Gunnison National Park' 'Bryce Canyon National Park'
 'Canyonlands National Park' 'Capitol Reef National Park'
 'Carlsbad Caverns National Park' 'Channel Islands National Park'
 'Congaree National Park' 'Crater Lake National Park'
 'Cuyahoga Valley National Park' 'Denali National Park and Preserve'
 'Death Valley National Park' 'Dry Tortuga

In [27]:
raw_data[~raw_data['Unnamed: 13'].isna()]

In [28]:
missing_ind = raw_data[~missing_filter].index

In [29]:
raw_data.loc[missing_ind, 'Conservation Status'].isna()

6441      True
31786    False
31826    False
44733    False
44944     True
Name: Conservation Status, dtype: bool

In [30]:
missing_filter = raw_data['Unnamed: 13'].isna()
missing_filter

0         True
1         True
2         True
3         True
4         True
          ... 
119243    True
119244    True
119245    True
119246    True
119247    True
Name: Unnamed: 13, Length: 119248, dtype: bool

In [32]:
missing_filter_12 = raw_data.loc[missing_ind, 'Conservation Status'].isna()

In [None]:
to_replace = [raw_data.loc[missing_ind, 'Conservation Status'][raw_data.loc[missing_ind, 'Conservation Status'].isna()].index

In [33]:
raw_data[~missing_filter][missing_filter_12] #find there is in 13 no empty values but in 12 column is empty

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status,Occurrence,Nativeness,Abundance,Seasonality,Conservation Status,Unnamed: 13
6441,BISC-1026,Biscayne National Park,Mammal,Sirenia,Trichechidae,Trichechus manatus,Manatee,Manati,Approved,Present,Unknown,Unknown,,Endangered
44944,GRSA-1347,Great Sand Dunes National Park and Preserve,Vascular Plant,Asparagales,Iridaceae,Iris missouriensis,Blue Flag,Wild Iris,Approved,Present,Native,Rare,,Species of Concern


In [None]:
raw_data.loc[~missing_filter & missing_filter_12, 'Conservation status'] = raw_data[~missing_filter][missing_filter_12]['Unnamed: 13]

In [None]:
raw_data.loc[~missing_filter & missing_filter_12, 'Conservation status']

In [35]:
raw_data.drop('Unnamed: 13', axis = 1, inplace = True)

In [36]:
raw_data.head()

Unnamed: 0,Species ID,Park Name,Category,Order,Family,Scientific Name,Common Names,Record Status,Occurrence,Nativeness,Abundance,Seasonality,Conservation Status
0,ACAD-1000,Acadia National Park,Mammal,Artiodactyla,Cervidae,Alces alces,Moose,Approved,Present,Native,Rare,Resident,
1,ACAD-1001,Acadia National Park,Mammal,Artiodactyla,Cervidae,Odocoileus virginianus,"Northern White-Tailed Deer, Virginia Deer, Whi...",Approved,Present,Native,Abundant,,
2,ACAD-1002,Acadia National Park,Mammal,Carnivora,Canidae,Canis latrans,"Coyote, Eastern Coyote",Approved,Present,Not Native,Common,,Species of Concern
3,ACAD-1003,Acadia National Park,Mammal,Carnivora,Canidae,Canis lupus,"Eastern Timber Wolf, Gray Wolf, Timber Wolf",Approved,Not Confirmed,Native,,,Endangered
4,ACAD-1004,Acadia National Park,Mammal,Carnivora,Canidae,Vulpes vulpes,"Black Fox, Cross Fox, Eastern Red Fox, Fox, Re...",Approved,Present,Unknown,Common,Breeder,


In [37]:
changed_rows = ~missing_filter & missing_filter_12
changed_index = raw_data.loc[changed_rows].index

In [40]:
with open ('species.csv', 'r', encoding = 'utf-8') as file:
    list_of_data = []
    for line in file:
        list_of_data.append(line)

In [42]:
for index in changed_index.tolist():
    print(index, list_of_data[index])

6441 BISC-1025,Biscayne National Park,Mammal,Rodentia,Sciuridae,Sciurus carolinensis,Eastern Gray Squirrel,Approved,Present,Unknown,Unknown,,,

44944 GRSA-1335,Great Sand Dunes National Park and Preserve,Vascular Plant,Apiales,Apiaceae,Heracleum maximum,"Common Cowparsnip, Cow Parsnip, Cowparsnip",Approved,Present,Native,Unknown,,,



# Numpy basics: working with NaNs

Numpy is very efficient, and thus the basis for multiple other packages, including pandas. It also more easily supports dimensions beyond 2. However, it is less straightforward in syntax and results. It also includes some limitations - mainly that arrays can only include elements of the same type. They have similar functionality to Pandas series and dataframes.
An array can be created in a few ways:

- `np.array()` creates an array based on another collection
- `np.zeros()` creates an array of `0` with the specified shape
- `np.arange()` creates an array with values specified similarly to the `range()`
- `np.linspace()` creates a specified number of evenly spaced values

Numpy also includes the *not-a-number* value: `np.nan`, which usually indicates a missing value in a dataset context. You can check whether a value is missing using `np.isnan()` (or `pd.isna()` in pandas).

Missing values can be handled in different ways, based on what is known about the data, their amount, importance, etc.. There is no single way to decide what to do with missing values. Pandas suggests two methods: `.dropna()`, which drops missing values, and `.fillna()`, which fills all missing values with a specified one or method. Let's try to replicate this in numpy:
1. Convert the dataframe loaded before to a Numpy array
2. Find all missing values in every column - how many are there?
3. Pick a strategy for handling the missing values in every column based on what you know about it. For the sake of this excercise, your choices are: drop rows, drop columns, fill every value with the previous one, or fill every value with the most common value.

In [43]:
import numpy as np

In [44]:
np.array([[1, 12, 3,], [23, 4, 5]])

array([[ 1, 12,  3],
       [23,  4,  5]])

In [47]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [48]:
empty = np.zeros(10)
empty[5] = 15
empty

array([ 0.,  0.,  0.,  0.,  0., 15.,  0.,  0.,  0.,  0.])