# Creation of pandas objects

In [10]:
import numpy as np
import pandas as pd
# Restricting number of displaying rows, just for convenience
pd.set_option('max_rows', 8)

## Series
Series is a 1-dimensional object containing array of values and corresponding index (labels for values). Data in Series belongs to 1 type i.e. homogeneous.  
It is very easy to create a Series:

In [15]:
numbers = pd.Series([1, 2, 3, 4, 10])
numbers

0     1
1     2
2     3
3     4
4    10
dtype: int64

On the left you can see an index (numbers from 0 to number of elements - 1 by default) and on the right - values. Also there is an information about data type in the bottom, int64 in this case (int stands for integer and 64 is a number of bits for each value in array)

Slightly more detailed variant - specifying index of series

In [17]:
floats = pd.Series([1.4, 0, 3.5], index=['a', 'c', 'b'])
floats

a    1.4
c    0.0
b    3.5
dtype: float64

Another way to create series is to pass a dictionary index: values. Keys from dictionary will become index and values - values)

In [20]:
pets = {'wolf': 27, 'bear': 2, 'falcon': 7}
pet_series = pd.Series(pets)
pet_series

bear       2
falcon     7
wolf      27
dtype: int64

Order of items in this case is random, to define it you can pass to `index` argument a list with index in desired order

In [22]:
ord_pet_series = pd.Series(pets, index=['wolf', 'bear', 'falcon'])
ord_pet_series

wolf      27
bear       2
falcon     7
dtype: int64

Series contains some other metadata attributes in addition to index, for instance name of series and name of series index, which can be specified in constructor (pd.Series) or assigned to series after creation.

In [29]:
pd.Index(list(pets.keys()), name='species')

Index(['wolf', 'bear', 'falcon'], dtype='object', name='species')

In [30]:
# Create index with name
ind = pd.Index(list(pets.keys()), name='species')
# Create Series with name and named index
pd.Series(pets, name='pets', index=ind)

species
wolf      27
bear       2
falcon     7
Name: pets, dtype: int64

In [31]:
# Specify names after creation by assigning to attributes
pet_series.name = 'pets'
pet_series.index.name = 'species'
pet_series

species
bear       2
falcon     7
wolf      27
Name: pets, dtype: int64

## DataFrame
Dataframe is 2-dimensional object which looks like a table with possibly heterogeneous data. It consists from Series, each of them has homogeneous (same type) data. There are several ways of creating dataframes, let's look at the manual first

In [32]:
# Dataframe from dict
animals = pd.DataFrame({'species': ['wolf', 'bear', 'falcon'], 
                        'population': [27, 2, 7], 
                        'mass': [100, 300, 10]})
animals

Unnamed: 0,mass,population,species
0,100,27,wolf
1,300,2,bear
2,10,7,falcon


In the fragment above we've passed to a constructor dictionary with our data, where each key is a name of dataframe column and has a list of values. So we transformed each key-value pair of dictionary in a column in the dataframe.  
Similarly to series default index was generated.

There is much more python objects including lists, numpy arrays and pandas series which can be fed to constructor to produce a dataframe, but we'll switch to creation dataframe from files

## Load data
One of the most frequently used way to create DataFrame is reading data from `csv` file, though there are diverse spectrum of methods applicable for different formats of data. For now let's go through csv.


Path to input file is a main argument to pandas functions which read data. In our case we will use `read_csv()` function and a file 'movie.csv' from data directory. This is the only required argument to this method, but it has many more which are set by default.  
You can specify them if you need something different: '\t' or ';' as delimeter instead of ',', load just subset of columns, specify column which should be an index etc.

In [13]:
# Read movie.csv and assign dataframe with data to variable movie
movie = pd.read_csv('data/movie.csv')
# Show it
movie

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660
4915,Color,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,...,84.0,English,USA,PG,1100.0,2004.0,23.0,6.6,1.85,456


## Writing to csv
Opposite task for loading data is writing it to disk. Simple `to_csv()` method is designed for saving dataframe to csv file. As 1st argument you should pass desired filename, also there is a lot of ooptional arguments for csv format tuning.

In [71]:
# Write dataframe directors to the file filename
directors.to_csv('filename')

Let's extract column from dataframe for the next section

In [34]:
# Take 1 column-series
# Note NaN in 4912 position more about it below
directors = movie['director_name']
directors

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
              ...        
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

## Some series attributes
Series and dataframes have many common attributes and methods with the same names. Thus you can learn most series methods and then extrapolate this knowledge on dataframes. Below you will find the number of shared attributes and methods 

In [39]:
len(set(filter(lambda x: not x.startswith('_'), dir(pd.Series)))
    .intersection(filter(lambda x: not x.startswith('_'), dir(pd.DataFrame))))

192

These attributes contain name of series, type of its values and index

In [8]:
print(directors.name)
print(directors.dtype)
print(directors.index)

director_name
object
RangeIndex(start=0, stop=4916, step=1)


In `values` you can find content of series

In [9]:
directors.values

array(['James Cameron', 'Gore Verbinski', 'Sam Mendes', ...,
       'Benjamin Roberds', 'Daniel Hsia', 'Jon Gunn'], dtype=object)

Size of Series can be found in several ways

In [71]:
# Number of elements in series
print(directors.size)
# Shape of array with values
print(directors.shape)
# Number of non NA values
print(directors.count())

4916
(4916,)
4814


As you can see last number is distinct from the previous two. `count()` method returns number of non NA values i.e. values that present in series. NA and its synonyms NaN, null are marks that there is no data for cell, usually this is due to absence of data. We will call missed values NA because it is the most widely used word from these 3.

Whether a Series has missing values - NA

In [88]:
directors.hasnans

True

## Some series methods
Several frequently used methods for looking at your data

In [96]:
# Start of Series
directors.head()

0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

In [97]:
# End of Series
directors.tail(7)

4909     Anthony Vallone
4910        Edward Burns
4911         Scott Smith
4912                 NaN
4913    Benjamin Roberds
4914         Daniel Hsia
4915            Jon Gunn
Name: director_name, dtype: object

Both `head()` and `tail()` methods have 1 optional argument `n` specifying number of showed rows, 5 by default

In [44]:
# Unique values in Series
directors.unique()

array(['James Cameron', 'Gore Verbinski', 'Sam Mendes', ...,
       'Scott Smith', 'Benjamin Roberds', 'Daniel Hsia'], dtype=object)

In [46]:
# Number of unique elements
directors.nunique()

2397

In [6]:
# Number of occurences of each element in series
directors.value_counts()

Steven Spielberg    26
Woody Allen         22
Martin Scorsese     20
Clint Eastwood      20
                    ..
Ian Iqbal Rashid     1
S.R. Bindler         1
Mike Gabriel         1
Michael McGowan      1
Name: director_name, Length: 2397, dtype: int64

In [41]:
# Frequency of elements including NA 
# In addition this method has sort argument and argument for binning data
directors.value_counts(normalize=True, dropna=False)

NaN                 0.020749
Steven Spielberg    0.005289
Woody Allen         0.004475
Clint Eastwood      0.004068
                      ...   
David Duchovny      0.000203
John Gatins         0.000203
Enrique Begne       0.000203
Julie Davis         0.000203
Name: director_name, Length: 2398, dtype: float64

## Some conversion methods
Here we will examine few methods for conversion series object to another type

In [6]:
# Conversion of values to list, index is dropped
directors.tolist()[:5]

['James Cameron',
 'Gore Verbinski',
 'Sam Mendes',
 'Christopher Nolan',
 'Doug Walker']

In [7]:
# Conversion to dictionary index: value
a = directors.to_dict()

# First items of dict
for i, (key, value) in enumerate(a.items()):
    print(key, value)
    if i >= 4:
        break

0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker


In [7]:
# Conversion of Series to DataFrame
directors.to_frame()

Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
...,...
4912,
4913,Benjamin Roberds
4914,Daniel Hsia
4915,Jon Gunn
