# Corporate Messaging Case Study
In this lesson, we'll learn about automating our machine learning workflows with pipelines using this dataset on corporate messages as a case study. This is one of the free datasets provided on the [Figure Eight Platform](https://www.figure-eight.com/data-for-everyone/).

![](figure8_corporate_messaging.png)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,category,category:confidence,category_gold,id,screenname,text
0,662822308,False,finalized,3,2/18/15 4:31,Information,1.0,,4.36528e+17,Barclays,Barclays CEO stresses the importance of regula...
1,662822309,False,finalized,3,2/18/15 13:55,Information,1.0,,3.86013e+17,Barclays,Barclays announces result of Rights Issue http...
2,662822310,False,finalized,3,2/18/15 8:43,Information,1.0,,3.7958e+17,Barclays,Barclays publishes its prospectus for its å£5....
3,662822311,False,finalized,3,2/18/15 9:13,Information,1.0,,3.6753e+17,Barclays,Barclays Group Finance Director Chris Lucas is...
4,662822312,False,finalized,3,2/18/15 6:48,Information,1.0,,3.60385e+17,Barclays,Barclays announces that Irene McDermott Brown ...


Each row contains information about a social media post from different corporations and contains a column for the category of the post.

If we look at the value counts for this category column,

In [3]:
df['category'].value_counts()

Information    2129
Action          724
Dialogue        226
Exclude          39
Name: category, dtype: int64

We can see that each post is classified as information, dialog, or action. If we look back on the Figure Eight site, we can find a description of each category:
- `information`: objective statements about the company or its activities
- `dialog`: like replies to users
- `action`: such as messages that ask for votes or users to click on links
    
Information seems to be the most common category by far, followed by action, and then dialogue. And there seems to be this miscellaneous "Exclude" column. 

Since we are only interested in rows classified into one of these top three categories with full confidence, `(df =)` we are going to narrow down to rows with confidence of 1 `(first half)` and category that is not "Exclude." `(second half)`

In [4]:
df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
df.category.value_counts()

Information    1823
Action          456
Dialogue        124
Name: category, dtype: int64

Awesome, this is what we have left. Now let's isolate the columns to be used as features and labels in this classifier and convert them to numpy arrays. Here that would be the text column and category column.

In [5]:
X = df.text.values
y = df.category.values

Here's the first row of text data.

In [6]:
X[0]

'Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG'

And the category for that first row.

In [7]:
y[0]

'Information'

To repeat the transformations made on this dataset, here are our steps in a nice load_data function that we'll use for the rest of our lesson.

In [8]:
def load_data():
    df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
    df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
    X = df.text.values
    y = df.category.values
    return X, y

Let's try it out.

In [9]:
X, y = load_data()

In [10]:
X.shape, y.shape

((2403,), (2403,))

Our data is looking good, with 2403 rows of corporate messaging data.

In [11]:
X[:10]

array([ 'Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG',
       'Barclays announces result of Rights Issue http://t.co/LbIqqh3wwG',
       'Barclays publishes its prospectus for its å£5.8bn Rights Issue: http://t.co/YZk24iE8G6',
       'Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health http://t.co/nkuHoAfnSD',
       'Barclays announces that Irene McDermott Brown has been appointed as Group Human Resources Director http://t.co/c3fNGY6NMT',
       'Barclays response to PRA capital shortfall exercise: http://t.co/LwsUQVFaMz',
       'Barclays sponsors #Zamynforum BBC World Service debate on globalisation, part of a series of citizenship lectures - http://t.co/5Mqcj0LIRg',
       'Barclays has today published its response to The Salz Review, the independent report into our business practices: http://t.co/QIrl6TuAtf',
       '59% of workers are ei

In [12]:
y[:10]

array(['Information', 'Information', 'Information', 'Information',
       'Information', 'Information', 'Information', 'Information',
       'Action', 'Action'], dtype=object)

Now that you're more familiar with this data, you'll need to clean it before we begin modeling. Your natural language processing skills will be helpful for this!