In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [4]:
txt = '''

Data Science is an interdisciplinary field that blends scientific methods, processes, algorithms, and systems to extract meaningful insights from structured and unstructured data. It encompasses a range of techniques from statistics, computer science, and domain-specific knowledge to interpret complex datasets. The process typically begins with data collection, where raw data is gathered from various sources. This is followed by data cleaning to preprocess and eliminate inconsistencies. Data analysis employs statistical and machine learning methods to uncover trends and patterns, while data visualization translates these insights into comprehensible visual formats, aiding in understanding and communication.

Predictive modeling is another critical aspect, where models are built to forecast future outcomes based on historical data. Interpreting these data-driven insights allows for informed decision-making across various sectors. Tools and technologies integral to data science include programming languages like Python and R, along with libraries such as Pandas, NumPy, and Scikit-learn. Data visualization tools like Matplotlib, Seaborn, Tableau, and Power BI help in presenting data insights effectively. Big Data technologies like Hadoop and Spark are employed for handling large volumes of data, while databases such as SQL and NoSQL manage data storage and retrieval.

The applications of data science are vast, impacting numerous industries. In business, it supports customer segmentation, sales forecasting, and market analysis. Healthcare benefits from predictive analytics for patient outcomes, drug discovery, and personalized medicine. In finance, data science aids in risk management, fraud detection, and algorithmic trading, while retail sectors use it for inventory management, recommendation systems, and analyzing customer sentiment.

Key skills for data scientists include statistical analysis, programming proficiency, machine learning expertise, data wrangling capabilities, and domain-specific knowledge. Current trends in the field highlight the increasing role of AI and machine learning, the rise of Automated Machine Learning (AutoML) to streamline model building, the challenges and opportunities presented by big data, and the growing focus on ethical considerations, including bias, privacy, and transparency.
'''

In [6]:
len(nlp.Defaults.stop_words)

326

In [7]:
nlp.vocab['is'].is_stop

True

In [8]:
nlp.vocab['Data'].is_stop

False

## Adding Custom Words into the list of Stopwords

In [9]:
len(nlp.Defaults.stop_words)

326

In [10]:
nlp.Defaults.stop_words.add('i.e')

In [11]:
nlp.vocab['i.e'].is_stop = True

In [12]:
len(nlp.Defaults.stop_words)

327

## Removing the Custom words from the list of stopwords

In [13]:
nlp.vocab['i.e'].is_stop

True

In [15]:
nlp.Defaults.stop_words.remove('i.e')
nlp.vocab['i.e'].is_stop = False

In [16]:
nlp.vocab['i.e'].is_stop

False

## Removing Stopwords from Corpus

In [17]:
txt

'\n\nData Science is an interdisciplinary field that blends scientific methods, processes, algorithms, and systems to extract meaningful insights from structured and unstructured data. It encompasses a range of techniques from statistics, computer science, and domain-specific knowledge to interpret complex datasets. The process typically begins with data collection, where raw data is gathered from various sources. This is followed by data cleaning to preprocess and eliminate inconsistencies. Data analysis employs statistical and machine learning methods to uncover trends and patterns, while data visualization translates these insights into comprehensible visual formats, aiding in understanding and communication.\n\nPredictive modeling is another critical aspect, where models are built to forecast future outcomes based on historical data. Interpreting these data-driven insights allows for informed decision-making across various sectors. Tools and technologies integral to data science in

In [25]:
text = "Data Science is an interdisciplinary field that combines scientific methods, processes, algorithms, and systems to derive meaningful insights from structured and unstructured data. It integrates techniques from statistics, computer science, and domain-specific knowledge to analyze complex datasets. The process typically starts with data collection, where raw data is gathered from various sources. This is followed by data cleaning to address inconsistencies and errors. Data analysis uses statistical and machine learning methods to uncover trends and patterns, and data visualization converts these insights into understandable visual formats, aiding in interpretation and communication."

In [19]:
txt = txt.replace('\n', '')
txt = txt.replace('  ', '')
txt = txt.strip()

txt

'DataScienceisaninterdisciplinaryfieldthatblendsscientificmethods,processes,algorithms,andsystemstoextractmeaningfulinsightsfromstructuredandunstructureddata.Itencompassesarangeoftechniquesfromstatistics,computerscience,anddomain-specificknowledgetointerpretcomplexdatasets.Theprocesstypicallybeginswithdatacollection,whererawdataisgatheredfromvarioussources.Thisisfollowedbydatacleaningtopreprocessandeliminateinconsistencies.Dataanalysisemploysstatisticalandmachinelearningmethodstouncovertrendsandpatterns,whiledatavisualizationtranslatestheseinsightsintocomprehensiblevisualformats,aidinginunderstandingandcommunication.Predictivemodelingisanothercriticalaspect,wheremodelsarebuilttoforecastfutureoutcomesbasedonhistoricaldata.Interpretingthesedata-driveninsightsallowsforinformeddecision-makingacrossvarioussectors.ToolsandtechnologiesintegraltodatascienceincludeprogramminglanguageslikePythonandR,alongwithlibrariessuchasPandas,NumPy,andScikit-learn.DatavisualizationtoolslikeMatplotlib,Seaborn

In [26]:
corp = nlp(text)

In [27]:
corp

Data Science is an interdisciplinary field that combines scientific methods, processes, algorithms, and systems to derive meaningful insights from structured and unstructured data. It integrates techniques from statistics, computer science, and domain-specific knowledge to analyze complex datasets. The process typically starts with data collection, where raw data is gathered from various sources. This is followed by data cleaning to address inconsistencies and errors. Data analysis uses statistical and machine learning methods to uncover trends and patterns, and data visualization converts these insights into understandable visual formats, aiding in interpretation and communication.

## Finding stopwords from corpus

In [32]:
stop_words = set()

for token in corp:
    if token.is_stop:
        stop_words.add(token.text)

print(stop_words)
print(len(stop_words))

{'is', 'The', 'an', 'and', 'to', 'It', 'various', 'by', 'from', 'where', 'that', 'in', 'into', 'with', 'This', 'these'}
16


In [31]:
len(stop_words)

30

In [34]:
' '.join([token.text for token in corp if not token.is_stop])

'Data Science interdisciplinary field combines scientific methods , processes , algorithms , systems derive meaningful insights structured unstructured data . integrates techniques statistics , computer science , domain - specific knowledge analyze complex datasets . process typically starts data collection , raw data gathered sources . followed data cleaning address inconsistencies errors . Data analysis uses statistical machine learning methods uncover trends patterns , data visualization converts insights understandable visual formats , aiding interpretation communication .'