# Operations 

## Sections
1. <a href = #functions/> UFuncs: Index Alignment in Series
2. <a href = #index/> Index alignment in DataFrame
3. <a href = #missing/> Handling Missing Data
4. <a href = #read/> Read and write CSV and XLS files

<a id = 'functions'/>

## 1. UFuncs: Index Alignment in Series
For binary operations on two Series or DataFrame objects, Pandas will align indices
in the process of performing the operation. This is very convenient when you are
working with incomplete data, as we’ll see in some of the examples

In [3]:
import pandas as pd
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')

In [4]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the union of indices of the two input arrays, which we
could determine using standard Python set arithmetic on these indices

In [5]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

This index matching is implemented this way for any of Python’s built-in arithmetic expressions; any missing values
are filled in with NaN by default

In [6]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behavior, we can modify the fill value using
appropriate object methods in place of the operators. For example, calling A.add(B)
is equivalent to calling A + B, but allows optional explicit specification of the fill value
for any elements in A or B that might be missing

In [7]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

<a id = 'index'/>

## 2. Index alignment in DataFrame

A similar type of alignment takes place for both columns and indices when you are
performing operations on DataFrames

In [8]:
import numpy as np
A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
A

Unnamed: 0,A,B
0,6,0
1,3,19


In [9]:
B = pd.DataFrame(np.random.randint(0, 10, (3, 3)), columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,2,0,9
1,5,6,6
2,1,6,1


In [10]:
A + B

Unnamed: 0,A,B,C
0,6.0,2.0,
1,9.0,24.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the associated
object’s arithmetic method and pass any desired fill_value to be used in place
of missing entries

In [11]:
A.stack()

0  A     6
   B     0
1  A     3
   B    19
dtype: int32

In [12]:
fill = A.stack().mean()
fill

7.0

In [13]:
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,6.0,2.0,16.0
1,9.0,24.0,13.0
2,13.0,8.0,8.0


 <a id = 'missing'/>

## 3. Handling Missing Data

* The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interestingdatasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.

* In this section, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. Here and throughout the book, we’ll refer to missing data in general as null, NaN, or NA values.

Generally, they revolve around one of two strategies: using a
mask that globally indicates missing values, or choosing a sentinel value that indicates
a missing entry.

1. Sentinel Value [-9999, NaN]
2. Mask

In [14]:
# None: Python Missing Data
import numpy as np
import pandas as pd
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

This dtype=object means that the best common type representation NumPy could
infer for the contents of the array is that they are Python objects. While this kind of
object array is useful for some purposes, any operations on the data will be done at
the Python level, with much more overhead than the typically fast operations seen for
arrays with native types

In [15]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
284 ms ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

dtype = int
4.75 ms ± 440 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



In [16]:
"""The use of Python objects in an array also means that if you perform aggregations
like sum() or min() across an array with a None value, you will generally get an error"""

vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [None]:
# NaN missing data
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

In [None]:
vals2.sum()

In [None]:
np.nansum(vals2)

### 3.1 NaN and None in Pandas

NaN and None both have their place, and Pandas is built to handle the two of them
nearly interchangeably, converting between them where appropriate

In [None]:
pd.Series([1, np.nan, 2, None])

### 3.2 Operating on Null Values

In [None]:
# Detecting null values 
data = pd.Series([1, np.nan, 'hello', None])
data

In [None]:
data.isnull()

In [None]:
data[data.notnull()]

In [None]:
# Dropping null values
data.dropna()

In [None]:
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df

We cannot drop single values from a DataFrame; we can only drop full rows or full
columns. Depending on the application, you might want one or the other, so
dropna() gives a number of options for a DataFrame

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

In [None]:
df[3] = np.nan
print(df)

In [None]:
df.dropna(axis='columns', thresh=3) #threshold

In [None]:
df.dropna(axis='columns', how='all') #how attribute

### 3.3 Filling Null Values
Sometimes rather than dropping NA values, you’d rather replace them with a valid
value. This value might be a single number like zero, or it might be some sort of
imputation or interpolation from the good values. You could do this in-place using
the isnull() method as a mask, but because it is such a common operation Pandas
provides the fillna() method, which returns a copy of the array with the null values
replaced.

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

In [None]:
data.fillna(0) # a single value such as zero

In [None]:
data.fillna(method = 'ffill') # forward-fill

In [None]:
data.fillna(method = 'bfill') # backward-fill

For DataFrames, the options are similar, but we can also specify an axis along which
the fills take place

In [None]:
df

In [None]:
df.fillna(method = 'ffill', axis =1 )

<a id = 'read'/>

## 4. Read and write CSV and XLS files

In [None]:
import pandas as pd
df = pd.read_csv('weather_data.csv')
df

In [None]:
#INSTALL: pip3 install xlrd

#read excel file 
df = pd.read_excel('weather_data.xlsx')
df

In [None]:
#write DF to csv
df.to_csv('new.csv') 
df.to_csv('new_noIndex.csv', index=False)

In [None]:
# INSTALL: pip3 install openpyxl

#write DF to Excel
df.to_excel('new.xlsx', sheet_name='weather_data')

### 4.1 GROUP-BY

In [None]:
import pandas as pd
df = pd.read_csv('weather_data_cities.csv')
df #weather by cities

In [None]:
g = df.groupby('city')
g

In [None]:
for city, city_df in g:
    print(city)
    print(city_df)

In [None]:
#or to get specific group
g.get_group('new york')


In [None]:
#Find maximum temperature in each of the cities
print(g.max())

In [None]:
print(g.mean())


In [None]:
print(g.describe())

### 4.2 concatenate Data Frames

In [None]:
import pandas as pd
india_weather = pd.DataFrame({
    "city": ["mumbai","delhi","banglore"],
    "temperature": [32,45,30],
    "humidity": [80, 60, 78]
})

india_weather

In [None]:
us_weather = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],
    "humidity": [68, 65, 75]
})
us_weather

In [None]:
#concate two dataframes
df = pd.concat([india_weather, us_weather])
df

In [None]:
#if you want continuous index
df = pd.concat([india_weather, us_weather], ignore_index=True)
df

In [None]:
df = pd.concat([india_weather, us_weather],axis=1)
df

### 4.3 Merge DataFrames

In [None]:
temperature_df = pd.DataFrame({
    "city": ["mumbai","delhi","banglore", 'hyderabad'],
    "temperature": [32,45,30,40]})
temperature_df

In [None]:
humidity_df = pd.DataFrame({
    "city": ["delhi","mumbai","banglore"],
    "humidity": [68, 65, 75]})
humidity_df

In [None]:
#merge two dataframes with out explicitly mention index
df = pd.merge(temperature_df, humidity_df, on='city')
df

In [None]:
#OUTER-JOIN
df = pd.merge(temperature_df, humidity_df, on='city', how='outer')
df