# Pandas is organized


<iframe src="https://giphy.com/embed/11zXBCAb9soCQM" width="480" height="480" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/gifnewstest-artists-on-tumblr-csaba-klement-11zXBCAb9soCQM">via GIPHY</a></p>

Pandas is another commonly used python module. It is used to help us organize and manipulate data. The pandas modules uses numpy arrays to create series and data frames, which can be thought of as a more feature rich single-dimension array and multi dimension array respectively. Lets looks at some useful features.

# Series

Series can be thought of as a one dimenstional numpy array. They contain an index of the data they are filled with as well. Numpy also has indexes but they are implicetly declared as opposed to pandas being expliclty shown.

In [1]:
import numpy as np
import pandas as pd 

pd_series = pd.Series([1, 2, 4, 8, 16, 32])
print(f"pd_series\n{pd_series}")


pd_series
0     1
1     2
2     4
3     8
4    16
5    32
dtype: int64


In [2]:
# We can also grab the series indices and values

print(f"Indices: {pd_series.index}")
print(f"Values: {pd_series.values}")

Indices: RangeIndex(start=0, stop=6, step=1)
Values: [ 1  2  4  8 16 32]


In [3]:
# We can also slice them like python lists
print(pd_series[3])
print("-"*8)
print(pd_series[3:])

8
--------
3     8
4    16
5    32
dtype: int64


In [4]:
# We can change their indexes too! 

pd_series_new_index = pd.Series([1, 2, 4, 8, 16, 32], index=["2^0","2^1","2^2","2^3","2^4","2^5"])
print(f"pd_series_new_index:\n{pd_series_new_index}")

pd_series_new_index:
2^0     1
2^1     2
2^2     4
2^3     8
2^4    16
2^5    32
dtype: int64


In [5]:
# We can slice by the new indices too!
print(pd_series_new_index["2^3"])
print("-"*8)
print(pd_series_new_index["2^3":])

8
--------
2^3     8
2^4    16
2^5    32
dtype: int64


You may have noticed that this looks like a special kind of dictionary. Pandas thought so too so they let us store python dictionaries in pandas series

In [6]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
pd_population = pd.Series(population_dict)
print(f"pd_population\n{pd_population}")

pd_population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [7]:
# You can slice this too! 

print(pd_population['California':'Illinois'])


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


## DataFrame Object

Series are a good introduction to pandas, but we might find DataFrames to be more useful especially if we are working with a lot of different data that relates to one another. We can either make a dataframe directly or we can make one from a series. Lets try both. 

In [8]:
# Lets create another series called area 
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
pd_area = pd.Series(area_dict)
print(f"pd_area\n{pd_area}")

pd_area
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64


In [9]:
# Then we can combine our series together to create a dataframe. Since they share the same indices they fit together nicely. 
pd_states = pd.DataFrame({"population": pd_population, "area": pd_area}) # Giving each series a column name

print(pd_states)


            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995


In [10]:
# Now we can access data based on the row, column, or both! 
print(pd_states["area"])
print(pd_states["population"]["California":"Florida"]) # Note this may be a bit more unintuitive since we usually reference row then column

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Name: population, dtype: int64


In [11]:
#It may be a bit more intuative to use the .iloc and .loc functions

# iloc is used for integer indexing, which tends to be more valuable for time stamped data 
print(pd_states.iloc[0])
print("-"*8)
# While loc is used for the text, which is more valuable for categorized data
print(pd_states.loc["California"])



population    38332521
area            423967
Name: California, dtype: int64
--------
population    38332521
area            423967
Name: California, dtype: int64


## Manipulating Data

Using Pandas there are many ways to manipulate the data that may be benifical for your specific data. Here we will go over a few manipulations. You can find the api references here: https://pandas.pydata.org/docs/reference/index.html

In [12]:
# Transposing the data
print("Transpose")
print(pd_states.T)
print()
# Sorting Indices
print("Sort Axis 0")
print(pd_states.sort_index(axis=0))
print()
# Copying Dataframe
pd_states_deep_copy = pd_states.copy(deep=True) # A deep copy seperates the copy from the original
print("Copy")
print(pd_states_deep_copy)
print()





Transpose
            California     Texas  New York   Florida  Illinois
population    38332521  26448193  19651127  19552860  12882135
area            423967    695662    141297    170312    149995

Sort Axis 0
            population    area
California    38332521  423967
Florida       19552860  170312
Illinois      12882135  149995
New York      19651127  141297
Texas         26448193  695662

Copy
            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995



In [13]:
#Dropping Index or Column
print("Dropping")
pd_states_deep_copy = pd_states_deep_copy.drop(columns=["population"])
print(pd_states_deep_copy)
print("-" * 8)
pd_states_deep_copy = pd_states_deep_copy.drop(["California", "New York"])
print(pd_states_deep_copy)
print()

Dropping
              area
California  423967
Texas       695662
New York    141297
Florida     170312
Illinois    149995
--------
            area
Texas     695662
Florida   170312
Illinois  149995



Your data may also have features or values you may want to filter out. You can filter these by using conditional statement with the .where and .mask methods.

In [14]:
# .where replaces values where the condition is false
some_series = pd.Series(range(-2, 10))
print(some_series)
invalid_neg_series = some_series.where(some_series>0, None)
print(invalid_neg_series)


0    -2
1    -1
2     0
3     1
4     2
5     3
6     4
7     5
8     6
9     7
10    8
11    9
dtype: int64
0     NaN
1     NaN
2     NaN
3     1.0
4     2.0
5     3.0
6     4.0
7     5.0
8     6.0
9     7.0
10    8.0
11    9.0
dtype: float64


In [15]:
print(some_series)
mask_invalid_neg_series = some_series.mask(some_series<0, None)
print(mask_invalid_neg_series)

0    -2
1    -1
2     0
3     1
4     2
5     3
6     4
7     5
8     6
9     7
10    8
11    9
dtype: int64
0     NaN
1     NaN
2     0.0
3     1.0
4     2.0
5     3.0
6     4.0
7     5.0
8     6.0
9     7.0
10    8.0
11    9.0
dtype: float64


The data you might get from the real world could have missing data, usually described as NaN, as shown above. This NaN value is equivalent to Python's None data type.

In [16]:
# Lets create a dataframe that shows which cats have a rabies shot. We can pretend this data was collected over the phone and some owners did not answer so we marked them as NaN
cat_info_data_dict = {
    "Cat 1": [12, None],
    "Cat 2": [11, True],
    "Cat 3": [10, True],
    "Cat 4": [8, None],
    "Cat 5": [7, False],
    "Cat 6": [10, True]
}

cat_info = pd.DataFrame.from_dict(cat_info_data_dict).T
cat_info.columns=["Weight", "Rabies Shot"]
print(cat_info)



      Weight Rabies Shot
Cat 1   12.0         NaN
Cat 2     11        True
Cat 3     10        True
Cat 4    8.0         NaN
Cat 5      7       False
Cat 6     10        True


In [17]:
# We can actually use the .mask method from above, but this is so common that pandas has built in methods just for dealing with missing data
dropped_na_cat_info = cat_info.dropna()
print(dropped_na_cat_info)

      Weight Rabies Shot
Cat 2     11        True
Cat 3     10        True
Cat 5      7       False
Cat 6     10        True


In [18]:
# Or we can fill it with a different value (maybe here we want to see a conservative estimate of the cats with rabies shots)
filled_na_cat_info = cat_info.fillna(False)
print(filled_na_cat_info)

       Weight  Rabies Shot
Cat 1    12.0        False
Cat 2    11.0         True
Cat 3    10.0         True
Cat 4     8.0        False
Cat 5     7.0        False
Cat 6    10.0         True


  filled_na_cat_info = cat_info.fillna(False)


In [19]:
# We can also mask the values
mask_na_cat_info = cat_info.isnull()
print(mask_na_cat_info)

       Weight  Rabies Shot
Cat 1   False         True
Cat 2   False        False
Cat 3   False        False
Cat 4   False         True
Cat 5   False        False
Cat 6   False        False


## Reading and Writing to a File

You are most likely get your data through modules, or through files. Or you may want to save your results to a file. Lets explore how we might do that using pandas. 


In [20]:
# Lets try writing to a csv file
exp_base_array = np.arange(0, 50, 1).reshape(50, 1)
exp_transformed_array = exp_base_array ** 2
combined_array = np.concatenate((exp_base_array, exp_transformed_array), axis=1)
pd_exp = pd.DataFrame(combined_array)
pd_exp.columns=["x", "x^2"]
pd_exp = pd_exp.set_index("x") # We must set the index otherwise pandas will make an index column for us and save that index column to a file
print(pd_exp)

pd_exp.to_csv("pd_exp.csv")

     x^2
x       
0      0
1      1
2      4
3      9
4     16
5     25
6     36
7     49
8     64
9     81
10   100
11   121
12   144
13   169
14   196
15   225
16   256
17   289
18   324
19   361
20   400
21   441
22   484
23   529
24   576
25   625
26   676
27   729
28   784
29   841
30   900
31   961
32  1024
33  1089
34  1156
35  1225
36  1296
37  1369
38  1444
39  1521
40  1600
41  1681
42  1764
43  1849
44  1936
45  2025
46  2116
47  2209
48  2304
49  2401


In [21]:
# Now lets read that info back

pd_read_exp = pd.read_csv("pd_exp.csv", index_col=0) # We must set the index column as 0 otherwise pandas makes a new index column for us
print(pd_read_exp)

     x^2
x       
0      0
1      1
2      4
3      9
4     16
5     25
6     36
7     49
8     64
9     81
10   100
11   121
12   144
13   169
14   196
15   225
16   256
17   289
18   324
19   361
20   400
21   441
22   484
23   529
24   576
25   625
26   676
27   729
28   784
29   841
30   900
31   961
32  1024
33  1089
34  1156
35  1225
36  1296
37  1369
38  1444
39  1521
40  1600
41  1681
42  1764
43  1849
44  1936
45  2025
46  2116
47  2209
48  2304
49  2401


References: https://jakevdp.github.io/PythonDataScienceHandbook/#:~:text=3.%20Data%20Manipulation%20with%20Pandas%C2%B6

Pandas Website: https://pandas.pydata.org/docs/index.html
Pandas API: https://pandas.pydata.org/docs/reference/index.html

Topics to Explore:
- Combining Datasets
- Aggregating and Grouping
- Working with time series
