# Reshaping data using pandas
- **Shape of data**
    - The way in which a dataset is organized in rows and columns
- **Wide format**
    - Each feature is in a separate column
    - Each rows contains many features of the same player
    - No repetition but large number of missing values
    - Simple statistics and imputation
- **Long format**
    - Each row represents one feature
    - Multiple rows for each player
    - A column(name) to identify same player
    - Tidy data:
        - Better to summarize data
        - Key-value pairs
        - Perferred for analysis and graphing
#### Reshaping data
- Transforming a DataFrame or Series structure to adjust it for analysis
    - Transposing a DataFrame
        - fifas_players.set_index('club')[['name', 'nationality']].transpose()
- Converting data from wide to long format and vice versa
- Unit of analysis:
    - Long format -> characteristic of a player
    - Wide format -> each player
#### Wide to long transformation
    - Performed using `pandas` functions, such as:
        - `.melt()`
        - `.wide_to_long()`
#### Long to wide format
     - Transform data using `pandas` methods, for example:
         - `.pivot()`
         - `.pivot_table()`
#### Reshaping using pivot method
    - From long to wide
        - Demonstrate relationship between two columns
        - Time series operatiopns with the variables
        - Operation that requires columns to be the uniqwue variable
#### Pivoting a dataset
- fifa.pivot(index='name', columns='variable', values='metric_system')
- fifa.pivot(index='name', columns='variable', values=['metric_system', 'imperial_system'])

#### Pivot method limitations
- General purpose pivoting
- Index/column pair must be unique
- Cannot aggregate values

#### Pivot table
- A DataFrame containinbg statistics that summarizes the data of a larger DataFrame
- The index argument takes the name of the column we want to have as an index in the new pivoted DataFrame. And the columns argument takes the name of the column we want to have as each column in the new DataFrame. Similar to what we saw for the pivot method.
- The values argument takes the name of the column which values we want to aggregate in the new pivoted DataFrame. In this case, we can also specify an aggregation function. If we omit this value, pandas will understand you want to take their average. But it is always a good practice to clarify which function we want.
- df.pivot_table(index="Year", columns="Name", values="Weight", aggfunc="mean")

#### Hierarchical indexes
- Another advantage of pivot tables is that we can have multi-level indexes. 
- fifa_players.pivot_table(index=["first", "last"], columns="movement", values=["overall", "attacking"], aggfunc="max")
- fifa_players.pivot_table(index=["first", "last"], columns="movement", values=["overall", "attacking"], aggfunc="count", margins=True)
    - When margins parameter is set to True, all the columns and rows will be added. 
    
#### Pivot or pivot table?
- Does the DataFrame have more than one value for each index/column pair?
- Do you need to have a multi-index in your resulting pivoted DataFrame?
- Do you need summary statistics of your large DataFrame?
**Yes!** Use `.pivot_table()`

#### Wide to long transformation
- But most data is stored in a wide format. So how do we reshape it? Pandas provides us with a very flexible function, the melt function.
- Perform analytics
- Plot different variables in the same graph

#### Melt
- When using melt on a DataFrame, the first argument to set is id_vars.
- **Melting Data**
    - This argument takes the names of the columns to use as identifier variables. In our example, the values we want to use as identifiers are the columns "first" and "last". These two columns appear in the long format and will help us match all the records for the same observation.
    - **Values and variables
        - df.melt(id_vars=["first", "last"], value_vars=["age", "height"], var_name="feature", value_name="amount")

#### Specifying values  to melt
    - books.melt(id_vars='title', value_vars=['language_code', 'num_page'])

#### Naming values and variables
    - books,melt(id_vars='title', value_vars=['language_code', 'isbn'], var_name='feature', value_name='code')
    
#### Wide to long transformation
    - pd.wide_to_long(df, stubnames=["age", "weight"],i="name",j="year")
        - The j argument tells pandas how we want to name the column that contains the suffix or the end of the wide columns.
        - The i argument takes the column or list of columns we will use as unique identifiers. 

#### Reshaping data
    - pd.wide_to_long(books, stubnames=['ratings', 'sold'], i='title', j='year')
    - We will apply wide to long function, passing in the books DataFrame telling pandas our columns have the prefixes ratings and sold, and that we want to call the new column with the suffix year and that the title column should be the unique index. We can see in the output our new long DataFrame. Now, title and year are indexes, while the columns rating and sold contains the values for each year.
   - **DataFrame with index**
       - It is important to mention that if we have a DataFrame with a named index, and we apply the wide to long function,the resulting DataFrame will not keep the original index.
       - If want to keep it, we modify the original DataFrame by resetting the index without dropping it. And then apply the transformation including the new column. As we can see in the output, the title is now part of the long DataFrame.
       - books_with_index.reset_index(drop=False, inplace=True)
       - pd.wide_to_long(books_with_index, stubnames=['ratings', 'sold'], i=['author', 'title'], j='year')
   - **sep argument**
       - This new DataFrame is very similar to the previous one, but the name of the columns contains an underscore between the prefix, ratings or sold, and the suffix, the year.
        - If we apply the transformation as before, we'll get an empty DataFrame. This happens because pandas doesn't recognize the name of the columns. It always assumes that the prefix is immediately followed by a numeric suffix.
        - To overcome this, we can use the sep argument. We specify that the separator element is an underscore. Now, pandas understands that the prefix ratings or sold is separated by an underscore from the year, and returns the correct DataFrame.
        - pd.wide_to_long(new_books, stubnames=['ratings', 'sold'], i=['title', 'author'], j='year', sep='_')
        - Finally, if the names of the wide columns do not end in a number and we apply the same transformation as before, we'll get an empty DataFrame since pandas assumes the suffixes are numeric.
        - To solve this, we use the suffix argument. We pass the following expression: backslash w plus. This expression indicates to pandas that the name of the column ends in a word. 
        - pd.wide_to_long(new_books, stubnames=['ratings', 'sold'], i=['title', 'author'], j='year', sep='_', suffix='\w+')        
   - **Columns with strings**
       - Splitting into two columns
           - We can use the split method of the str attribute, passing in the element to split. The method returns a list for each row. Each list contains the two sub-strings obtained from splitting the title by the colon. We could also access only one of the resulting elements.
           - In that case, we use the get method from the str attribute, passing in the index of the element we want. In our example, we get the element of index zero. The get method returns the first split element of each row.
           - We can also set the expand argument of split to True. This will return a new DataFrame with two columns, one for each split element.
           - This allows to assign the split elements to columns in the original DataFrame. In our example, we first split the column title by the colon, indicating we want to expand it to two columns, and assign it to two new columns, "main_title" and "subtitle". This is useful because now, we can drop the original column title. And after that, transform the DataFrame by using the new columns as index, getting a clean long DataFrame with a multi-level index.
           - books['title'].str.split(":")
           - books['title'].str.split(":").get(0)
           - books['title'].str.split(":", expand=True)
           - books[['Main_title', 'subtitle']] = books['title'].str.split(":", expand=True)
               - books.drop('title', axis=1, inplace=True)
               - pd.wide_to_long(books, stubnames=['ratings', 'sold'], i=['main_title', 'subtitle'], j='year')
 
   - **Concatenate two columns**
       - This is helpful because then, we can melt our DataFrame using this new column as index instead of using the two original columns.
       - The cat and split methods can also be used for indexes. The following DataFrame has an index named main_title.
       - To concatenate the index with a column in the DataFrame, we access the cat method from the str attribute from the index. We assign it to the index, getting the new concatenated string.
           - books_new['name_author'].str.cat(books_new['lastname_author'].sep=' ')
           - books_new['author'] = books_new['name_author'].str.cat(books_new['lastname_author'].sep=' ')
           - books_new.melt(id_vars='author', value_vars=['nationality', 'number_books'], var_name='feature', value_name='value')
           - comic_marvel.index = comic_marvel.index.str.cat(comics_marvel['subtile'], sep='-')
           - Split index
               - comics_marvel.index = comics_marvel.index.str.split('-', expand=True)
           - Concatenate Series
               - books_new['name_author'].str.cat(new_list, sep=' ')
               
   - **Stacking DataFrames**
       - Setting the index
           - churn.set_index(['country', 'age'], inplace=True)
       - MultiIndex from array
           - Another option is to use the method from_arrays() from MultiIndex. In this case, we define a list of lists named new_array. Each element represents one index. We call the from_arrays() method passing new_array and a list of names we want for the indexes. We assign it to the original DataFrame index by calling the index attribute. As a result, we get a DataFrame with two indices on the rows: "member" and "credit_card".
           - new_array = [['yes', 'no', 'yes'], ['no', 'yes', 'yes']]
           - chur.index = pd.MultiIndex.from_arrays(new_array, names=['member', 
           'credit_card'])
        - MultiIndex DataFrame
            - The process is very similar. We create two MultiIndexes using the method from_arrays(): one for the index and one for the columns. When we create the DataFrame, we set the index and the columns to be the recently created multi-level indexes. As a result, we get a DataFrame with multi-level indexes on the rows and on the columns.
         - The .stack() method
             - The stack() method will reshape the DataFrame with a multi-level index by converting it into a stacked form and stacking means rearranging the innermost column index to become the innermost row index.
             - Let's take our DataFrame that had a multi-level index on the rows. We apply the stack() method. We have a simple column index. So stack will compress the last level in the DataFrame columns to produce a Series.
             - This DataFrame has a multi-level index in the columns. We'll apply the stack() method. As a consequence, stack() will compress the last level in the columns to produce a DataFrame.
             - It is also possible to choose which level to stack. We want to stack the first column level, so we set the level argument to zero. Now, the stacked level becomes the new lowest level in the row multi-level index. It's important to remember that if we don't set the level argument, stack() will move the last level by default.
             - Our DataFrame has named column levels, so we can specify the level to stack by passing in the column name. In the code, we set level to year. In the resulting DataFrame, we see that the year level has now become the innermost row level.
             - patients.stack()
             - patients.stack(level=0)
             - patients.stack(level='year')
             
   - **Unstacking DataFrames**
       - Pandas provides us with the unstack() method. The unstacking process performs exactly the inverse operation of stacking. So, unstacking means rearranging the innermost row index to become the innermost column index. If we apply the unstack() method, we can see that the innermost row level has now moved to the innermost column level. 
       - We apply the unstack() method. As a result, we can see that the last row level, the feature level has moved to the column level.
       - We can also choose which level to unstack. To that aim, we set the level argument to the index number or the index name as we did with the stack() method.
       - The stack() and unstack() methods implicitly sort the index levels.
       - We can use the sort_index() method.
       - One useful way to rearrange levels is to chain the stacking and unstacking processes. Let's unstack the second row level and then, stack the first column level. In the output, we see that the row level named first appears now in the column index. Also, the column level named year has moved to the row index.
       - patients_stacked.unstack() 
       - patients_stacked.unstack(level=1)
       - patients_stacked.unstack(level='First')
       - patients_stacked.unstack().sort_index(ascending=False)
       - patients_stacked.unstack(level=1).stack(level=0) -> Rearranging levels

   - **Working with multiple levels**
       - Swap levels and unstack
           - The swaplevel() method can switch the order of two levels within the same axis. This means that we can swap the order of two row levels or two column levels.
           - We apply the swaplevel() method passing the index zero and two. In the output, we can see how the first and third row levels are now interchanged.  df.swaplevel(0,2)
           - We can now chain it with the unstacking process. We can see that the row level containing the price and sold features was moved to the column index. If we haven't changed the order of the levels, the unstacked level would have been the brand level.
           - We first unstack the last row index level, then swap the first and second column levels. We do this by setting the axis parameter to 1. We can see how the year appears on top of the brand level.
           - We can also stack the column index of cars, then call the swaplevel() method, passing zero and two as arguments. In the output, we can see how the recently stacked level and the original first level are switched.
           - Unstacking several levels at the same time is easy. We just have to pass a list of the index numbers to the level parameter. In the output, we see that the first and second row levels are now on the column index. The resulting DataFrame has three levels on the row indices.
           - We use the same syntax to stack several levels. We could pass a list of index numbers or their respective names. In both cases, we get a resulting DataFrame where the year and brand levels are now in the row indices. It's important to notice that the order in which you pass the names matters. In the example, we pass year and then brand, so the innermost level will be the brand level.
           - cars.swaplevel(0,2).unstack()
           - cars.unstack().swaplevel(0,1, axis=1)
           - cars.stack().swaplevel(0,2)
           - cars.unstack(level=[0, 1])              Unstacking levels by number
           - cars.unstack(level=['brand', 'model'])  Unstacking levels by name

   - **Handling missing data**
       - Unstacking leads to missing values
           - Subgroup do not have the same set of labels
               - animals.unstack(level='class')
               - animals.unstack(level='class', fill_value='No').sort_index(level=['order', 'name'], ascending=[True, False])
               - Stack and missing values
                   - Combinations of index and column values missing from the orginal DataFrame
                       - This happens because stack() has the argument dropna set to True by default. This drops all rows that have only missing values.
                           - flowers.stack(dropna=True)
                       - If for some reason we want to keep that information, we need to set the dropna argument to False. We can see in the resulting DataFrame that the row with indices rose size is now present. All its values are missing values.
                           - flowers.stack(dropna=False)
                           
                       - We could then fill the missing values using the method fillna(). We pass the value with which we want to replace the missing values. And the resulting DataFrame will have zeros instead of NaNs.
                           - flowers.stack(dropna=False).fillna(0)
                           
   - **Reshaping and combining data**
       - Statistical functions
           - Sum: `.sum()`
           - Mean: `.mean()`
           - Median: `.median()`
           - Difference: `.diff()`
        - Stacking and stats
            - Total amount of online and on-site sales by year in the two countries
                - To obtain the total amount of online and onsite sales by year in the two countries, we chain the stack and the sum functions and apply it to the sales DataFrame as you see in the code. We set axis to 1 to apply it over the column axis. The stack method returns a DataFrame with a new inner-most index level, the shop label. The sum method gives the total amount of products sold.
                - sales.stack.sum(axis=1)
             - Total amount of on-site sales by year in the two countries
                 - sales.stack().sum(axis=1).unstack()