# <span style="color:darkred; font-weight:bold; font-size:48px;"><i>Transformer</i>-Powered Article Summarization</span>

**By: Hiba Fathallah**

**CS495 Final Project**

<!-- Centering the image -->
<div style="text-align:center">
  <!-- Inserting the image with a URL or file path -->
  <img src="https://i.ibb.co/QNmx6Hm/image.png" alt="decod" width="900" height="600">
</div>

# **<span style="color:darkblue">1.  Introduction**

Summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, while other models can generate entirely new text.
* Abstractive Summarization
* Extractive Summarization


**<span style="color:darkred"> Use Cases:** 
    
* Research Paper Summarization 🧐
* News Summarization
* Legal Document Summarization

# **<span style="color:darkblue">2.  Data Exploration**

In [1]:
import pandas as pd
import numpy as np

# Data Visualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio

# Statistics & Mathematics
import scipy.stats as stats
import statsmodels.api as sm
from scipy.stats import shapiro, skew, anderson, kstest, gaussian_kde,spearmanr
import math

In [2]:
file_path = "/kaggle/input/wikihow/wikihowAll.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,headline,title,text
0,"\nKeep related supplies in the same area.,\nMa...",How to Be an Organized Artist1,"If you're a photographer, keep all the necess..."
1,\nCreate a sketch in the NeoPopRealist manner ...,How to Create a Neopoprealist Art Work,See the image for how this drawing develops s...
2,"\nGet a bachelor’s degree.,\nEnroll in a studi...",How to Be a Visual Effects Artist1,It is possible to become a VFX artist without...
3,\nStart with some experience or interest in ar...,How to Become an Art Investor,The best art investors do their research on t...
4,"\nKeep your reference materials, sketches, art...",How to Be an Organized Artist2,"As you start planning for a project or work, ..."


In [3]:
# Configuring Pandas to exhibit larger columns
'''
This is going to allow us to fully read the dialogues and their summary 
'''
pd.set_option('display.max_colwidth', 1000)

In [4]:
first_row_text = df.loc[0, 'headline']

# Print the entire text of the specified column for the first row
print(first_row_text)

first_row_text = df.loc[0, 'text']

# Print the entire text of the specified column for the first row
print(first_row_text)


Keep related supplies in the same area.,
Make an effort to clean a dedicated workspace after every session.,
Place loose supplies in large, clearly visible containers.,
Use clotheslines and clips to hang sketches, photos, and reference material.,
Use every inch of the room for storage, especially vertical space.,
Use chalkboard paint to make space for drafting ideas right on the walls.,
Purchase a label maker to make your organization strategy semi-permanent.,
Make a habit of throwing out old, excess, or useless stuff each month.
 If you're a photographer, keep all the necessary lens, cords, and batteries in the same quadrant of your home or studio. Paints should be kept with brushes, cleaner, and canvas, print supplies should be by the ink, etc. Make broader groups and areas for your supplies to make finding them easier, limiting your search to a much smaller area. Some ideas include:


Essential supplies area -- the things you use every day.
Inspiration and reference area.
Dedicated

In [5]:
print(f"\n{type(df).__name__} shape: {df.shape}")
print(f'\nMissing Data: \n{df.isnull().sum()}')
print(f'\nDuplicates: {df.duplicated().sum()}')


DataFrame shape: (215365, 3)

Missing Data: 
headline     818
title          1
text        1071
dtype: int64

Duplicates: 0


In [6]:
print(f'\n{type(df).__name__} Head: \n')
display(df.head(5))
print(f'\n{type(df).__name__} Tail: \n')
display(df.tail(5)) 


DataFrame Head: 



Unnamed: 0,headline,title,text
0,"\nKeep related supplies in the same area.,\nMake an effort to clean a dedicated workspace after every session.,\nPlace loose supplies in large, clearly visible containers.,\nUse clotheslines and clips to hang sketches, photos, and reference material.,\nUse every inch of the room for storage, especially vertical space.,\nUse chalkboard paint to make space for drafting ideas right on the walls.,\nPurchase a label maker to make your organization strategy semi-permanent.,\nMake a habit of throwing out old, excess, or useless stuff each month.",How to Be an Organized Artist1,"If you're a photographer, keep all the necessary lens, cords, and batteries in the same quadrant of your home or studio. Paints should be kept with brushes, cleaner, and canvas, print supplies should be by the ink, etc. Make broader groups and areas for your supplies to make finding them easier, limiting your search to a much smaller area. Some ideas include:\n\n\nEssential supplies area -- the things you use every day.\nInspiration and reference area.\nDedicated work area .\nInfrequent or secondary supplies area, tucked out of the way.;\n, This doesn't mean cleaning the entire studio, it just means keeping the area immediately around the desk, easel, pottery wheel, etc. clean each night. Discard trash or unnecessary materials and wipe down dirty surfaces. Endeavor to leave the workspace in a way that you can sit down the next day and start working immediately, without having to do any work or tidying.\n\n\nEven if the rest of your studio is a bit disorganized, an organized worksp..."
1,"\nCreate a sketch in the NeoPopRealist manner of the future mural on a small piece of paper 8""x10"" using the black ink pen.,\nPrepare to create your NeoPopRealist mural.,\nPrepare your paint.,\nBegin your project with a design.,\nProduce a scaled down version of your finished mural.,\nPrepare the wall to be painted.,\nAfter you have primed the surface, measure the wall.,\nPaint in the base coat of the background.,\nAllow the background and base coats to dry.,\nDraw the lines, then fill the appeared section with different repetitive patterns (examine the images above).,\nPaint patterns with brushes of suitable size for the particular portion of work you are painting.,\nClean up the lines and shapes as needed.,\nSeal the mural if needed.,\nBe inspired and it will help you succeed!",How to Create a Neopoprealist Art Work,"See the image for how this drawing develops step-by-step. However, there is an important detail: the following drawings are to examine it, and then, to create something unique.\n\n\nUse the lines to create the image shape and sections.\nFill appeared sections with different patterns/ ornaments.\nAdd text if needed, for example ""NeoPopRealism is 25!""\nAdd a colored strip on the top, any color you wish.;\n, Painting a mural always requires some preparation. You‘ll need equipment and effort, but planning and attention to detail will help you succeed. Painting a mural requires a suitable location, with the right surface that can be painted.\n\nThis surface should be smooth and flat. However, even rough-textured surfaces can be used for your NeoPopRealist mural project.\n\n, For exterior projects that last for years, using a newer 100% acrylic exterior paint would be your best choice. For interior walls, use latex paints. Latex offer easier cleanup and lower costs. By measuring the tot..."
2,"\nGet a bachelor’s degree.,\nEnroll in a studio-based program.,\nTrain on a number of VFX computer programs.,\nWatch online tutorials.,\nNurture your artistic side.,\nPay close attention to movies, television shows, and video games.,\nDevelop a specialization.",How to Be a Visual Effects Artist1,"It is possible to become a VFX artist without a college degree, but the path is often easier with one. VFX artists usually major in fine arts, computer graphics, or animation. Choose a college with a reputation for strength in these areas and a reputation for good job placement for graduates. The availability of internships is another factor to consider.Out of the jobs advertised for VFX artists, a majority at any given time specify a bachelor’s degree as a minimum requirement for applicants.;\n, Some studios offer short-term programs for people who want to learn more about VFX artistry without pursuing a college degree. Enrolling in these programs can be expensive as financial aid isn’t always offered, but they usually have the most cutting edge technology for you to learn from., Although you may create some hand sketches, the majority of your work will be completed on the computer using the most up-to-date programs. Stay informed about the newest software advances by following V..."
3,"\nStart with some experience or interest in art.,\nUnderstand the difference between art collectors, art investors and art speculators.,\nFigure out what you are willing to pay for art, before going to an auction house.,\nPay attention to what schools of art are selling well, and which are down.,\nFocus art investments on fine art paintings, rather than decorative art.,\nReach out to trusted auction houses and dealers when you are looking to buy art.,\nBuy your investment art when you feel confident of its worth, its price and its ability to grow in value.,\nStudy how art is properly stored.,\nHave your art investments appraised occasionally.,\nConsider renting out your art investments.,\nUnderstand that selling an art investment can take time.",How to Become an Art Investor,"The best art investors do their research on the pieces of art that they buy, so someone with some education or interest in the art world is more likely to understand this niche market. As well as personal research, you will need to have contacts with people in the art world, such as auctioneers, gallery directors and dealers, who can give you good investment advice.;\n, You may confuse these three terms, if you are not careful. Each of them has a slightly different goal in mind when looking to buy art.\n\n\nArt collectors do not buy art for investment purposes. They buy it to decorate and display in their home. Because they consider them to be an important part of their home or life, most art collectors have a hard time parting with pieces of their collection. While many collectors do end up selling some pieces of art, it may be done because of necessity. Collectors often loan their works out to museums and occasionally donate them to museums upon their death.\nArt investors seek ..."
4,"\nKeep your reference materials, sketches, articles, photos, etc, in one easy to find place.,\nMake ""studies,"" or practice sketches, to organize effectively for larger projects.,\nLimit the supplies you leave out to the project at hand.,\nKeep an updated list of all of the necessary supplies, and the quantities of each.,\nBreak down bigger works into more easily completed parts.",How to Be an Organized Artist2,"As you start planning for a project or work, you'll likely be gathering scraps of inspiration and test sketches. While everyone has a strategy, there is nothing more maddening than digging through a book or the internet to re-find the cool idea you saw three months ago. Try out:\n\n\nDedicating 1 notebook, preferably with insert folders, to each project.\nMaking a bookmark folder for each project on your internet browser to easily compile online inspiration.\nTacking up physical inspiration on a wall or cork board near your workspace., Very few artists simply dive right into large projects. Almost 100% of the time they instead work on related, smaller projects called ""studies"" to prepare for the larger work. You might practice the face of the portrait you're making, sketch our different composition ideas, or practice a vulnerable or difficult part of a sculpture. Keep these organized as a way to prepare both the skills, ideas, and supplies needed for the final project.\n\n, At the..."



DataFrame Tail: 



Unnamed: 0,headline,title,text
215360,"\nConsider changing the spelling of your name.,\nAvoid symbols in your name.,\nAdd an exotic element.,\nConsider how people will pronounce your name.,\nConsider your international profile.,\nBe consistent with spelling and formatting.",How to Pick a Stage Name3,"If you have a name that you like, you might fiddle with the spelling to see if alternate letters will make it more interesting. The band Gotye, pronounced “Go-tee-ay,” is a respelling of the French surname Gaultier.Sometimes this isn’t a good idea, especially if you’re adding an extra letter where it is really unnecessary. You might just risk confusing people and making it difficult to pronounce your name.\n\n, While it may be a hot thing to replace an S in your name with a $ or an I with a !, these just add confusion and likely mistakes in spelling your name. Even though Ke$ha and others have done it, you should skip this.\n\n\nThe singer Prince changed his name to a symbol in order to get out of his contract with Warner Bros. in 1993. Since the symbol was unpronounceable, he was called The Artist Formerly Known as Prince. This really only works if you already have a well-established reputation and fan following, and ultimately would make things too complicated anyway. Prince rev..."
215361,"\nTry out your name.,\nDon’t legally change your name.,\nRegister your stage name with a trade guild or union.,\nUpdate your bank account information.,\nReserve social media accounts with your stage name.,\nReserve a website domain.",How to Pick a Stage Name4,"Your name might sound great to you when you say it out loud in your bedroom. Find out how it sounds when someone else is announcing you. Think of this as market testing your name , Unless you are completely abandoning your real name altogether, there is no need to legally change your name. This will help you maintain a distinction between your personal and professional lives.\n\n, If you are currently a member of a trade guild, such as the Screen Actors Guild or the American Federation of Musicians, you should update your membership information with your stage name. It’s best to ensure that no one else in your guild has the same name as you.If you are not yet part of a union or guild, you may consider joining one sometime in the future. In this case, keep in mind that you should probably register with your real name and stage name in one membership.\n\n, You may want to include your stage name on your bank account. This is especially true if you have a business bank account and yo..."
215362,"\nUnderstand the process of relief printing.,\nExamine the rim of the print.,\nLook for signs of embossing.,\nLook for signs of cutting in the cross-hatching or shaded areas.",How to Identify Prints1,"Relief printing is the oldest and most traditional printing technology, and involves reproducing images at its most basic. In relief printing, a wood or metal relief block is carved by cutting away the areas of the picture that will not be printed, then ink is applied to the raised areas either by dabbing the areas to be printed, or rolling the ink on. The final stage of the process involves transferring the ink to the page by laying a sheet of paper and applying pressure. Examples of relief prints include:\n\n\nWood block printing\nLinocut\nType-set;\n, One of the quickest and most reliable ways of identifying relief prints is to examine the edges of the print for evidence. The process by which ink is transferred from the block via pressure will produce a characteristic rim around the edges of life. This is a feature that is only characterized by relief printing processes, so it's always a sure sign.For comparison purposes, examine the serial number on any bill of US currency. Yo..."
215363,"\nUnderstand the process of intaglio printing.,\nLook for plate marks.,\nLook for raised ink.,\nLook for varying intensity of color in single lines.,\nLook at the shape of the line.,\nStudy more intaglio techniques.",How to Identify Prints2,"Intaglio is Italian for ""incis­ing,"" and correspondingly revolves around a process of applying ink into the grooves or etches or engravings, then using a lot of pressure to transfer that ink from the indents onto the page. This usually results in slightly crisper, more substantial lines that you can learn to identify. The process was developed in the 1500s. Engraving and etching are both styles of intaglio printing, with slightly different techniques and signifiers.Engraving is typically done on copper plates, using a burin, a v-shaped cutting tool, to remove slivers of metal from the surface of the plate. The shape of engraved lines are typically quite clean, and pointed at each end, where the lines will swell or shrink.\n\nEtching is done using acid to draw freely over wax placed on the copper plating, using a needle. Etched lines will have a blunter end than engraved lines, and you should be able to see signs of the wax in unevenness and crumbling at the edge of the lines. In g..."
215364,"\nUnderstand the different varieties of lithography.,\nMagnify the image.,\nLook for the absence of plate marks.,\nLook for the flatness of the ink.,\nLook for the illusion of shade, created by multiple layers.,\nLook for blurriness.",How to Identify Prints3,"Lithography is a big term often used to refer to many different styles of printing, contemporary and classical. But, in pre-photographic terms, planographic lithography is that which is printed from a flat surface. In planographic printing, plates are prepared by laying down an image in a greasy or oily substance, typically called tusche, that will hold ink. The blank areas of the plate will then be washed off with water, removing the ink from those areas. Types of planographic lithography include:\n\n\nChalk-manner prints, which are made by using wax crayon to draw the image onto limestone.\nChromolithography, which are identifiable based on the stippling of multiple colors on the plate.\nTinted lithography is made via two plates, one of which uses broad individual background strokes of tinting to give the image background color.\nTransfer lithography isn't transferred directly from stone to paper, but from transfer paper to the stone itself, meaning that the image needn't be dra..."


In [7]:
df.drop('title', axis=1, inplace=True)
df.dropna(inplace=True)

In [8]:
print(f'\nMissing Data: \n{df.isnull().sum()}')


Missing Data: 
headline    0
text        0
dtype: int64


##### After removing the column title and removing the rows where we have the nan values, we want to inspect the length of both the summary and the article

In [9]:
# Configuring notebook
seed = 42
colormap = 'cividis'
template = 'plotly_dark'

In [10]:
def histogram_boxplot(df,hist_color, box_color, height, width, legend, name):
    '''
    This function plots a Histogram and a Box Plot side by side

    Parameters:
    hist_color = The color of the histogram
    box_color = The color of the boxplots
    heigh and width = Image size
    legend = Either to display legend or not
    '''

    features = df.select_dtypes(include = [np.number]).columns.tolist()

    for feat in features:
        try:
            fig = make_subplots(
                rows=1,
                cols=2,
                subplot_titles=["Box Plot", "Histogram"],
                horizontal_spacing=0.2
            )

            density = gaussian_kde(df[feat])
            x_vals = np.linspace(min(df[feat]), max(df[feat]), 200)
            density_vals = density(x_vals)

            fig.add_trace(go.Scatter(x=x_vals, y = density_vals, mode = 'lines',
                                     fill = 'tozeroy', name="Density", line_color=hist_color), row=1, col=2)
            fig.add_trace(go.Box(y=df[feat], name="Box Plot", boxmean=True, line_color=box_color), row=1, col=1)

            fig.update_layout(title={'text': f'<b>{name} Word Count<br><sup><i>&nbsp;&nbsp;&nbsp;&nbsp;{feat}</i></sup></b>',
                                     'x': .025, 'xanchor': 'left'},
                             margin=dict(t=100),
                             showlegend=legend,
                             template = template,
                             #plot_bgcolor=bg_color,paper_bgcolor=paper_color,
                             height=height, width=width
                            )

            fig.update_yaxes(title_text=f"<b>Words</b>", row=1, col=1, showgrid=False)
            fig.update_xaxes(title_text="", row=1, col=1, showgrid=False)

            fig.update_yaxes(title_text="<b>Frequency</b>", row=1, col=2,showgrid=False)
            fig.update_xaxes(title_text=f"<b>Words</b>", row=1, col=2, showgrid=False)

            fig.show()
            print('\n')
        except Exception as e:
            print(f"An error occurred: {e}")

In [11]:
columns = ['headline', 'text']
df_text_length = pd.DataFrame() # Creating an empty dataframe

for feat in columns: # Iterating through features --> Dialogue & Summary
    text_length = f'{feat}_length'
    df_text_length[text_length] = df[feat].apply(lambda x: len(str(x).split())) #  Counting words for each feature

# Plotting histogram-boxplot
histogram_boxplot(df_text_length,'#89c2e0', '#d500ff', 600, 1000, True, 'Train Dataset')









In [12]:
df = pd.concat([df, df_text_length], axis=1)

In [13]:
df.head()

Unnamed: 0,headline,text,headline_length,text_length
0,"\nKeep related supplies in the same area.,\nMake an effort to clean a dedicated workspace after every session.,\nPlace loose supplies in large, clearly visible containers.,\nUse clotheslines and clips to hang sketches, photos, and reference material.,\nUse every inch of the room for storage, especially vertical space.,\nUse chalkboard paint to make space for drafting ideas right on the walls.,\nPurchase a label maker to make your organization strategy semi-permanent.,\nMake a habit of throwing out old, excess, or useless stuff each month.","If you're a photographer, keep all the necessary lens, cords, and batteries in the same quadrant of your home or studio. Paints should be kept with brushes, cleaner, and canvas, print supplies should be by the ink, etc. Make broader groups and areas for your supplies to make finding them easier, limiting your search to a much smaller area. Some ideas include:\n\n\nEssential supplies area -- the things you use every day.\nInspiration and reference area.\nDedicated work area .\nInfrequent or secondary supplies area, tucked out of the way.;\n, This doesn't mean cleaning the entire studio, it just means keeping the area immediately around the desk, easel, pottery wheel, etc. clean each night. Discard trash or unnecessary materials and wipe down dirty surfaces. Endeavor to leave the workspace in a way that you can sit down the next day and start working immediately, without having to do any work or tidying.\n\n\nEven if the rest of your studio is a bit disorganized, an organized worksp...",84,608
1,"\nCreate a sketch in the NeoPopRealist manner of the future mural on a small piece of paper 8""x10"" using the black ink pen.,\nPrepare to create your NeoPopRealist mural.,\nPrepare your paint.,\nBegin your project with a design.,\nProduce a scaled down version of your finished mural.,\nPrepare the wall to be painted.,\nAfter you have primed the surface, measure the wall.,\nPaint in the base coat of the background.,\nAllow the background and base coats to dry.,\nDraw the lines, then fill the appeared section with different repetitive patterns (examine the images above).,\nPaint patterns with brushes of suitable size for the particular portion of work you are painting.,\nClean up the lines and shapes as needed.,\nSeal the mural if needed.,\nBe inspired and it will help you succeed!","See the image for how this drawing develops step-by-step. However, there is an important detail: the following drawings are to examine it, and then, to create something unique.\n\n\nUse the lines to create the image shape and sections.\nFill appeared sections with different patterns/ ornaments.\nAdd text if needed, for example ""NeoPopRealism is 25!""\nAdd a colored strip on the top, any color you wish.;\n, Painting a mural always requires some preparation. You‘ll need equipment and effort, but planning and attention to detail will help you succeed. Painting a mural requires a suitable location, with the right surface that can be painted.\n\nThis surface should be smooth and flat. However, even rough-textured surfaces can be used for your NeoPopRealist mural project.\n\n, For exterior projects that last for years, using a newer 100% acrylic exterior paint would be your best choice. For interior walls, use latex paints. Latex offer easier cleanup and lower costs. By measuring the tot...",131,608
2,"\nGet a bachelor’s degree.,\nEnroll in a studio-based program.,\nTrain on a number of VFX computer programs.,\nWatch online tutorials.,\nNurture your artistic side.,\nPay close attention to movies, television shows, and video games.,\nDevelop a specialization.","It is possible to become a VFX artist without a college degree, but the path is often easier with one. VFX artists usually major in fine arts, computer graphics, or animation. Choose a college with a reputation for strength in these areas and a reputation for good job placement for graduates. The availability of internships is another factor to consider.Out of the jobs advertised for VFX artists, a majority at any given time specify a bachelor’s degree as a minimum requirement for applicants.;\n, Some studios offer short-term programs for people who want to learn more about VFX artistry without pursuing a college degree. Enrolling in these programs can be expensive as financial aid isn’t always offered, but they usually have the most cutting edge technology for you to learn from., Although you may create some hand sketches, the majority of your work will be completed on the computer using the most up-to-date programs. Stay informed about the newest software advances by following V...",37,455
3,"\nStart with some experience or interest in art.,\nUnderstand the difference between art collectors, art investors and art speculators.,\nFigure out what you are willing to pay for art, before going to an auction house.,\nPay attention to what schools of art are selling well, and which are down.,\nFocus art investments on fine art paintings, rather than decorative art.,\nReach out to trusted auction houses and dealers when you are looking to buy art.,\nBuy your investment art when you feel confident of its worth, its price and its ability to grow in value.,\nStudy how art is properly stored.,\nHave your art investments appraised occasionally.,\nConsider renting out your art investments.,\nUnderstand that selling an art investment can take time.","The best art investors do their research on the pieces of art that they buy, so someone with some education or interest in the art world is more likely to understand this niche market. As well as personal research, you will need to have contacts with people in the art world, such as auctioneers, gallery directors and dealers, who can give you good investment advice.;\n, You may confuse these three terms, if you are not careful. Each of them has a slightly different goal in mind when looking to buy art.\n\n\nArt collectors do not buy art for investment purposes. They buy it to decorate and display in their home. Because they consider them to be an important part of their home or life, most art collectors have a hard time parting with pieces of their collection. While many collectors do end up selling some pieces of art, it may be done because of necessity. Collectors often loan their works out to museums and occasionally donate them to museums upon their death.\nArt investors seek ...",122,939
4,"\nKeep your reference materials, sketches, articles, photos, etc, in one easy to find place.,\nMake ""studies,"" or practice sketches, to organize effectively for larger projects.,\nLimit the supplies you leave out to the project at hand.,\nKeep an updated list of all of the necessary supplies, and the quantities of each.,\nBreak down bigger works into more easily completed parts.","As you start planning for a project or work, you'll likely be gathering scraps of inspiration and test sketches. While everyone has a strategy, there is nothing more maddening than digging through a book or the internet to re-find the cool idea you saw three months ago. Try out:\n\n\nDedicating 1 notebook, preferably with insert folders, to each project.\nMaking a bookmark folder for each project on your internet browser to easily compile online inspiration.\nTacking up physical inspiration on a wall or cork board near your workspace., Very few artists simply dive right into large projects. Almost 100% of the time they instead work on related, smaller projects called ""studies"" to prepare for the larger work. You might practice the face of the portrait you're making, sketch our different composition ideas, or practice a vulnerable or difficult part of a sculpture. Keep these organized as a way to prepare both the skills, ideas, and supplies needed for the final project.\n\n, At the...",60,438


In [14]:
def remove_outliers(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df


In [15]:
columns_to_remove_outliers = ['headline_length', 'text_length']

# Remove outliers based on word count columns
df = remove_outliers(df, columns_to_remove_outliers)
df_text_length= remove_outliers(df_text_length,columns_to_remove_outliers)

In [16]:
df.shape

(186446, 4)

In [17]:
# Plotting histogram-boxplot
histogram_boxplot(df_text_length,'#89c2e0', '#d500ff', 600, 1000, True, 'Train Dataset')









We can see that the median for text is equal to 272 so 50 % of the data has at least 272 words and the median for the summary is equal to 37 so 50% of the data has at least 37 words in it. So we will be using these threshold to fix our data for the tokenization.

* Q3(text)=450
* Q3(summary)=63

##### During the preprocessing: we removed the following :
* outliers
* emojis
* URLs
* stop words, italic sentences
* transformed all the words in lowercase
* removed enters and backslashes
* numerals

In [65]:
def preprocess_text(text):
    # Convert all words with capital letters to lowercase
    text = text.lower()

    # Remove escape characters
    text = re.sub("(\\t|\\r|\\n)", ' ', str(text))

    # Remove emojis
    text = emoji.demojize(text)

    # Remove punctuation
    punctuation_pattern = re.compile("[^\w\s.]+")
    text = punctuation_pattern.sub("", text)

    # Remove numerals
    numeral_pattern = re.compile("[0-9]+")
    text = numeral_pattern.sub("", text)
    
    # Regular expression to match text in italics
    italics_pattern = re.compile(r'\*(.*?)\*|\_(.*?)\_')
    text = re.sub(italics_pattern, '', text)

    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

    return text

In [66]:
df['text'] = df['text'].apply(preprocess_text)
df['headline'] = df['headline'].apply(preprocess_text)

Taking advantage of spaCy .pipe() method to speed-up the cleaning process:

By processing text in batches, the code can take advantage of parallelization and reduce the memory footprint, making it more scalable for large datasets.

In [None]:


from time import time
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

text_preprocess = preprocess_text(df['text'])
summary_preprocess = preprocess_text(df['headline'])


t = time()

#Batch the data points into 5000 and run on all cores for faster preprocessing
text = [str(doc) for doc in nlp.pipe(text_preprocess, batch_size=5000)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

t = time()

#Batch the data points into 5000 and run on all cores for faster preprocessing
summary = [str(doc) for doc in nlp.pipe(summary_preprocess, batch_size=5000)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

In [67]:
df.head()

Unnamed: 0,headline,text,headline_length,text_length
0,keep related supplies in the same area make an effort to clean a dedicated workspace after every session place loose supplies in large clearly visible containers use clotheslines and clips to hang sketches photos and reference material use every inch of the room for storage especially vertical space use chalkboard paint to make space for drafting ideas right on the walls purchase a label maker to make your organization strategy semipermanent make a habit of throwing out old excess or useless stuff each month.,if youre a photographer keep all the necessary lens cords and batteries in the same quadrant of your home or studio paints should be kept with brushes cleaner and canvas print supplies should be by the ink etc make broader groups and areas for your supplies to make finding them easier limiting your search to a much smaller area some ideas include essential supplies area the things you use every day inspiration and reference area dedicated work area infrequent or secondary supplies area tucked out of the way this doesnt mean cleaning the entire studio it just means keeping the area immediately around the desk easel pottery wheel etc clean each night discard trash or unnecessary materials and wipe down dirty surfaces endeavor to leave the workspace in a way that you can sit down the next day and start working immediately without having to do any work or tidying even if the rest of your studio is a bit disorganized an organized workspace will help you get down to business every t...,84,608
1,create a sketch in the neopoprealist manner of the future mural on a small piece of paper x using the black ink pen prepare to create your neopoprealist mural prepare your paint begin your project with a design produce a scaled down version of your finished mural prepare the wall to be painted after you have primed the surface measure the wall paint in the base coat of the background allow the background and base coats to dry draw the lines then fill the appeared section with different repetitive patterns examine the images above paint patterns with brushes of suitable size for the particular portion of work you are painting clean up the lines and shapes as needed seal the mural if needed be inspired and it will help you succeed,see the image for how this drawing develops stepbystep however there is an important detail the following drawings are to examine it and then to create something unique use the lines to create the image shape and sections fill appeared sections with different patterns ornaments add text if needed for example neopoprealism is add a colored strip on the top any color you wish painting a mural always requires some preparation youll need equipment and effort but planning and attention to detail will help you succeed painting a mural requires a suitable location with the right surface that can be painted this surface should be smooth and flat however even roughtextured surfaces can be used for your neopoprealist mural project for exterior projects that last for years using a newer acrylic exterior paint would be your best choice for interior walls use latex paints latex offer easier cleanup and lower costs by measuring the total wall area to be covered the total amount of paint can b...,131,608
2,get a bachelors degree enroll in a studiobased program train on a number of vfx computer programs watch online tutorials nurture your artistic side pay close attention to movies television shows and video games develop a specialization.,it is possible to become a vfx artist without a college degree but the path is often easier with one vfx artists usually major in fine arts computer graphics or animation choose a college with a reputation for strength in these areas and a reputation for good job placement for graduates the availability of internships is another factor to consider.out of the jobs advertised for vfx artists a majority at any given time specify a bachelors degree as a minimum requirement for applicants some studios offer shortterm programs for people who want to learn more about vfx artistry without pursuing a college degree enrolling in these programs can be expensive as financial aid isnt always offered but they usually have the most cutting edge technology for you to learn from although you may create some hand sketches the majority of your work will be completed on the computer using the most uptodate programs stay informed about the newest software advances by following vfx blogs and taking onl...,37,455
3,start with some experience or interest in art understand the difference between art collectors art investors and art speculators figure out what you are willing to pay for art before going to an auction house pay attention to what schools of art are selling well and which are down focus art investments on fine art paintings rather than decorative art reach out to trusted auction houses and dealers when you are looking to buy art buy your investment art when you feel confident of its worth its price and its ability to grow in value study how art is properly stored have your art investments appraised occasionally consider renting out your art investments understand that selling an art investment can take time.,the best art investors do their research on the pieces of art that they buy so someone with some education or interest in the art world is more likely to understand this niche market as well as personal research you will need to have contacts with people in the art world such as auctioneers gallery directors and dealers who can give you good investment advice you may confuse these three terms if you are not careful each of them has a slightly different goal in mind when looking to buy art art collectors do not buy art for investment purposes they buy it to decorate and display in their home because they consider them to be an important part of their home or life most art collectors have a hard time parting with pieces of their collection while many collectors do end up selling some pieces of art it may be done because of necessity collectors often loan their works out to museums and occasionally donate them to museums upon their death art investors seek to diversify their portfoli...,122,939
4,keep your reference materials sketches articles photos etc in one easy to find place make studies or practice sketches to organize effectively for larger projects limit the supplies you leave out to the project at hand keep an updated list of all of the necessary supplies and the quantities of each break down bigger works into more easily completed parts.,as you start planning for a project or work youll likely be gathering scraps of inspiration and test sketches while everyone has a strategy there is nothing more maddening than digging through a book or the internet to refind the cool idea you saw three months ago try out dedicating notebook preferably with insert folders to each project making a bookmark folder for each project on your internet browser to easily compile online inspiration tacking up physical inspiration on a wall or cork board near your workspace very few artists simply dive right into large projects almost of the time they instead work on related smaller projects called studies to prepare for the larger work you might practice the face of the portrait youre making sketch our different composition ideas or practice a vulnerable or difficult part of a sculpture keep these organized as a way to prepare both the skills ideas and supplies needed for the final project at the end of the day artists are visual peopl...,60,438


In [70]:
csv_path_kaggle = '/kaggle/working/data.csv'

# Save the DataFrame to a CSV file
df.to_csv(csv_path_kaggle, index=False)