# Pandas Foundations

- [How to set options to vizualise more columns and rows](#section_id1)
- [How to read a data frame and see the head and the tail](#section_id2) 
- [Data frame atribute: index, column, values](#section_id3)
- [Data types, value counts, info](#section_id4)
- [Different ways to select just one column or specific rows](#section_id5)
- [Series Methods: size, shape, sample, value counts, count, unique, quantile, max, min, mean, median, std, describe, isna, fillna, dropna ](#section_id6)
- [Series Operations: Soma, divisão, ...](#section_id7)
- [Chaining series methods: ](#section_id8)
- [Renaming columns](#section_id9)
- [Creating and deleting columns: inserting in a specif place](#section_id10)

In [1]:
import pandas as pd
import numpy as np

### Seting options to visualize more columns and rows 
<a id='section_id1'></a>

In [3]:
pd.set_option('max_columns', 4, 'max_rows', 10)

### Reading the csv data frame <br>
#### Using .head() method to visualize a specific number of rows, we also can use .tail() to see the last rows
<a id='section_id2'></a>

In [3]:
movies = pd.read_csv('../data/movie.csv')
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


## DataFrame Attributes 
<a id='section_id3'></a>

pandas uses NaN (not a number) to represent missing values. Notice that even though the
color column has string values, it uses NaN to represent a missing value.

### accesing columns, indexes and values 

In [6]:
columns = movies.columns
index = movies.index
data = movies.values

### We can use the method .values to transform into a numpy.array

In [10]:
type(index)

pandas.core.indexes.range.RangeIndex

In [20]:
type(index.values)

numpy.ndarray

In [22]:
index.to_numpy()

array([   0,    1,    2, ..., 4913, 4914, 4915], dtype=int64)

In [11]:
type(columns)

pandas.core.indexes.base.Index

In [19]:
type(columns.values) 

numpy.ndarray

In [26]:
type(columns.to_numpy())

numpy.ndarray

In [12]:
type(data)

numpy.ndarray

In [13]:
issubclass(pd.RangeIndex, pd.Index)

True

## Understanding data types 
<a id='section_id4'></a>

float – The NumPy float type, which supports missing values <br>
int – The NumPy integer type, which does not support missing values <br>
'Int64' – pandas nullable integer type <br>
object – The NumPy type for storing strings (and mixed types)  <br>
'category' – pandas categorical type, which does support missing values  <br>
bool – The NumPy Boolean type, which does not support missing values (None becomes False, np.nan becomes True)  <br>
'boolean' – pandas nullable Boolean type  <br>
datetime64[ns] – The NumPy date type, which does support missing values (NaT)  

### Accesing the data types

In [28]:
movies.dtypes

color                       object
director_name               object
num_critic_for_reviews     float64
duration                   float64
director_facebook_likes    float64
                            ...   
title_year                 float64
actor_2_facebook_likes     float64
imdb_score                 float64
aspect_ratio               float64
movie_facebook_likes         int64
Length: 28, dtype: object

In [30]:
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

In [31]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      4897 non-null   object 
 1   director_name              4814 non-null   object 
 2   num_critic_for_reviews     4867 non-null   float64
 3   duration                   4901 non-null   float64
 4   director_facebook_likes    4814 non-null   float64
 5   actor_3_facebook_likes     4893 non-null   float64
 6   actor_2_name               4903 non-null   object 
 7   actor_1_facebook_likes     4909 non-null   float64
 8   gross                      4054 non-null   float64
 9   genres                     4916 non-null   object 
 10  actor_1_name               4909 non-null   object 
 11  movie_title                4916 non-null   object 
 12  num_voted_users            4916 non-null   int64  
 13  cast_total_facebook_likes  4916 non-null   int64

### How it works...

In [32]:
pd.Series(['Paul', np.nan, 'George']).dtype

dtype('O')

### Selecting only one column
<a id='section_id5'></a>

In [None]:
movies['director_name']

In [None]:
movies.director_name

In [None]:
movies.loc[:, 'director_name']

In [None]:
movies.iloc[:, 1]

In [34]:
movies['director_name'].index

RangeIndex(start=0, stop=4916, step=1)

In [35]:
movies['director_name'].dtype

dtype('O')

In [36]:
movies['director_name'].size

4916

In [37]:
movies['director_name'].name

'director_name'

In [38]:
type(movies['director_name'])

pandas.core.series.Series

In [39]:
movies['director_name'].apply(type).unique()

array([<class 'str'>, <class 'float'>], dtype=object)

## Calling Series Methods
<a id='section_id6'></a>

In [40]:
s_attr_methods = set(dir(pd.Series))
len(s_attr_methods)

434

In [41]:
df_attr_methods = set(dir(pd.DataFrame))
len(df_attr_methods)

441

In [42]:
len(s_attr_methods & df_attr_methods)

384

### How to do it

In [43]:
director = movies['director_name']
fb_likes = movies['actor_1_facebook_likes']

In [44]:
director.dtype

dtype('O')

In [45]:
fb_likes.dtype

dtype('float64')

In [None]:
director.head()

In [None]:
director.sample(n=5, random_state=42)

In [None]:
fb_likes.head()

In [None]:
director.value_counts()

In [None]:
fb_likes.value_counts()

In [None]:
director.size

In [None]:
director.shape

In [None]:
len(director)

In [None]:
director.unique()

In [None]:
director.count() #Return the number of non missing values

In [None]:
fb_likes.count()

In [None]:
fb_likes.quantile()

In [None]:
fb_likes.min()

In [None]:
fb_likes.max()

In [None]:
fb_likes.mean()

In [None]:
fb_likes.median()

In [None]:
fb_likes.std()

In [None]:
fb_likes.describe()

In [None]:
director.describe()

In [None]:
fb_likes.quantile(.2)

In [None]:
fb_likes.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])

In [None]:
director.isna()

In [None]:
fb_likes_filled = fb_likes.fillna(0)
fb_likes_filled.count()

In [None]:
fb_likes_dropped = fb_likes.dropna()
fb_likes_dropped.size

### There's more...

In [50]:
director.value_counts(normalize=True)

Steven Spielberg      0.005401
Woody Allen           0.004570
Clint Eastwood        0.004155
Martin Scorsese       0.004155
Spike Lee             0.003324
                        ...   
Allen Hughes          0.000208
Deb Hagan             0.000208
Matthew Robbins       0.000208
J.S. Cardone          0.000208
Christopher Erskin    0.000208
Name: director_name, Length: 2397, dtype: float64

In [None]:
director.hasnans

In [None]:
director.notna()

## Series Operations
<a id='section_id7'></a>

In [51]:
imdb_score = movies['imdb_score']
imdb_score

0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

In [52]:
imdb_score + 1

0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

In [None]:
imdb_score * 2.5

In [None]:
imdb_score // 7

In [53]:
imdb_score > 7

0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

In [54]:
director = movies['director_name']
director == 'James Cameron'

0        True
1       False
2       False
3       False
4       False
        ...  
4911    False
4912    False
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

### There's more...

In [None]:
imdb_score.add(1)   # imdb_score + 1

In [None]:
imdb_score.gt(7)   # imdb_score > 7

## Chaining Series Methods
<a id='section_id8'></a>

In [None]:
movies = pd.read_csv('data/movie.csv')
fb_likes = movies['actor_1_facebook_likes']
director = movies['director_name']

In [None]:
director.value_counts().head(3)

In [None]:
fb_likes.isna().sum()

In [None]:
fb_likes.dtype

In [None]:
(fb_likes.fillna(0)
         .astype(int)
         .head()
)

### There's more...

In [None]:
(fb_likes.fillna(0)
         #.astype(int)
         #.head()
)

In [None]:
(fb_likes.fillna(0)
         .astype(int)
         #.head()
)

In [None]:
fb_likes.isna().mean()

In [None]:
fb_likes.fillna(0) \
        .astype(int) \
        .head()

In [None]:
def debug_df(df):
    print("BEFORE")
    print(df)
    print("AFTER")
    return df

In [None]:
(fb_likes.fillna(0)
         .pipe(debug_df)
         .astype(int) 
         .head()
)

In [None]:
intermediate = None
def get_intermediate(df):
    global intermediate
    intermediate = df
    return df

In [None]:
res = (fb_likes.fillna(0)
         .pipe(get_intermediate)
         .astype(int) 
         .head()
)

In [None]:
intermediate

## Renaming Column Names
<a id='section_id9'></a>

In [None]:
movies = pd.read_csv('data/movie.csv')

In [None]:
col_map = {'director_name':'Director Name', 
             'num_critic_for_reviews': 'Critical Reviews'} 

In [None]:
movies.rename(columns=col_map).head()

In [None]:
idx_map = {'Avatar':'Ratava', 'Spectre': 'Ertceps',
  "Pirates of the Caribbean: At World's End": 'POC'}
col_map = {'aspect_ratio': 'aspect',
  "movie_facebook_likes": 'fblikes'}
(movies
   .set_index('movie_title')
   .rename(index=idx_map, columns=col_map)
   .head(3)
)

In [None]:
movies = pd.read_csv('data/movie.csv', index_col='movie_title')
ids = movies.index.tolist()
columns = movies.columns.tolist()

# rename the row and column labels with list assignments

In [None]:
ids[0] = 'Ratava'
ids[1] = 'POC'
ids[2] = 'Ertceps'
columns[1] = 'director'
columns[-2] = 'aspect'
columns[-1] = 'fblikes'
movies.index = ids
movies.columns = columns

In [None]:
movies.head(3)

In [None]:
def to_clean(val):
    return val.strip().lower().replace(' ', '_')

In [None]:
movies.rename(columns=to_clean).head(3)

In [None]:
cols = [col.strip().lower().replace(' ', '_')
        for col in movies.columns]
movies.columns = cols
movies.head(3)

## Creating and Deleting columns
<a id='section_id10'></a>

In [None]:
movies = pd.read_csv('data/movie.csv')
movies['has_seen'] = 0

In [None]:
idx_map = {'Avatar':'Ratava', 'Spectre': 'Ertceps',
  "Pirates of the Caribbean: At World's End": 'POC'}
col_map = {'aspect_ratio': 'aspect',
  "movie_facebook_likes": 'fblikes'}
(movies
   .rename(index=idx_map, columns=col_map)
   .assign(has_seen=0)
)

In [None]:
total = (movies['actor_1_facebook_likes'] +
         movies['actor_2_facebook_likes'] + 
         movies['actor_3_facebook_likes'] + 
         movies['director_facebook_likes'])

In [None]:
total.head(5)

In [None]:
cols = ['actor_1_facebook_likes','actor_2_facebook_likes',
    'actor_3_facebook_likes','director_facebook_likes']
sum_col = movies[cols].sum(axis='columns')
sum_col.head(5)

In [None]:
movies.assign(total_likes=sum_col).head(5)

In [None]:
def sum_likes(df):
   return df[[c for c in df.columns 
              if 'like' in c]].sum(axis=1)

In [None]:
movies.assign(total_likes=sum_likes).head(5)

In [None]:
(movies
   .assign(total_likes=sum_col)
   ['total_likes']
   .isna()
   .sum()
)

In [None]:
(movies
   .assign(total_likes=total)
   ['total_likes']
   .isna()
   .sum()
)

In [None]:
(movies
   .assign(total_likes=total.fillna(0))
   ['total_likes']
   .isna()
   .sum()
)

In [None]:
def cast_like_gt_actor_director(df):
    return df['cast_total_facebook_likes'] >= \
           df['total_likes']

In [None]:
df2 = (movies
   .assign(total_likes=total,
           is_cast_likes_more = cast_like_gt_actor_director)
)

In [None]:
df2['is_cast_likes_more'].all()

In [None]:
df2 = df2.drop(columns='total_likes')

In [None]:
actor_sum = (movies
   [[c for c in movies.columns if 'actor_' in c and '_likes' in c]]
   .sum(axis='columns')
)

In [None]:
actor_sum.head(5)

In [None]:
movies['cast_total_facebook_likes'] >= actor_sum

In [None]:
movies['cast_total_facebook_likes'].ge(actor_sum)

In [None]:
movies['cast_total_facebook_likes'].ge(actor_sum).all()

In [None]:
pct_like = (actor_sum
    .div(movies['cast_total_facebook_likes'])
)

In [None]:
pct_like.describe()

In [None]:
pd.Series(pct_like.values,
    index=movies['movie_title'].values).head()

In [None]:
profit_index = movies.columns.get_loc('gross') + 1
profit_index

In [None]:
movies.insert(loc=profit_index,
              column='profit',
              value=movies['gross'] - movies['budget'])

In [None]:
del movies['director_name']

### See also