# EDS220 Assignment Week 1: Panda Series and Data Frames

From this [link](https://meds-eds-220.github.io/MEDS-eds-220-course/book/chapters/lesson-2-series-dataframes.html) 


In [5]:
#importing libraries
import pandas as pd
import numpy as np

# Core Object: Series 

A **series** is a one-dimensional array of indexed data

In [5]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[0.72335368 0.97920494 0.13363149 0.8532242 ] 

<class 'pandas.core.series.Series'>
0    0.723354
1    0.979205
2    0.133631
3    0.853224
dtype: float64


A pandas series has an index as part of the data structure, whereas Numpy does not (even though Numpy is still indexable).

### Creating a Pandas Series

Basic method to create a pandas series: 

`s = pd.Series(data, index=index)`

The "data" can be: 
- a list or NumPy array
- a dictionary
- a single number
- boolean
- string

Index is optional, but the default is to make the index to 0? Because its python?

In [6]:
# Creating a Panda series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

In [7]:
# Creating a Panda series from a list with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

In [8]:
# Creating Pandas series from dictionary 

d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

In [9]:
# Creating a Pandas.series from a single value 
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

### Simple Operations/Arithmetic 

It works in series, ergo it works with most numpy

In [10]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Original series is unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


Creating a new `pandas.Series` with `True/False` values

A simple condition that will be useful when getting data from data frames

In [13]:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

### Identifying Missing Values

Dealing with Nulls and Nas by using float value `numpy.nan` 
- Nan stands for "not a number"


In [11]:
# Making a series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

In [12]:
# Checking if the series has NAs, it will say True if it has Nas
s.hasnans

True

In [13]:
# Figuring out which element has the NAs

s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

# Check-in 1

In [6]:
# Creating a series with four values, two of which being -999, and adding A-D index
s = pd.Series([1, 2, -999, -999], index = ['A', 'B', 'C', 'D'])

In [15]:
print(s)

A      1
B      2
C   -999
D   -999
dtype: int64


In [9]:
# Using a mask function to turn -999s into NaNs
# and reassign the s so that it changes the actual data frame
s = s.mask(s == -999)

## Creating a pandas.DataFrame

In [42]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame from the dictionaries
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [43]:
# Changing the index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


In [44]:
# Looking at the column labels
df.columns

Index(['col_name_1', 'col_name_2'], dtype='object')

# Check-in 2

In [45]:
# Renaming the columns using the .columns attribute
df.columns = ['C1', 'C2']

In [46]:
# Looking at the column labels
df.columns

Index(['C1', 'C2'], dtype='object')

# Summary of Lesson

The index is the main difference between  Panda series and a Numpy arrays. In a panda series the index is part of the data structure. During this lesson I learned how to create a Panda.series from a numpy array, list, dictionary and single value. I also learned how to find NAs and NaNs and replace them. I also learned about the importance of data frames. I learned how to replace -999 with NaNs, and how to rename columns with the columns attribute.