In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd


In [2]:
text="""The Comprehensive Guide to Data Science
Introduction
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It is a continuation of some data analysis fields such as statistics, data mining, and predictive analytics. With the advent of big data and the increasing importance of data-driven decisions, data science has become a critical area of expertise.

The Evolution of Data Science
Early Beginnings
The roots of data science can be traced back to statistics and mathematics. In the 1960s and 1970s, the term "data analysis" was commonly used. During this time, businesses and research institutions began using computers for data processing and analysis, laying the foundation for what we now call data science.

The Rise of Big Data
The late 1990s and early 2000s saw an explosion in the amount of data generated by digital technologies. The term "big data" emerged to describe datasets that were too large and complex to be processed by traditional data-processing software. This period also saw the development of new technologies like Hadoop and NoSQL databases, which enabled the storage and analysis of large datasets.

The Modern Era
Today, data science encompasses a wide range of techniques and tools from various fields such as machine learning, artificial intelligence, statistics, and computer science. The rise of cloud computing, advanced algorithms, and sophisticated data visualization tools has further propelled the field forward.

Key Components of Data Science
Data Collection
Data collection is the first step in the data science process. This involves gathering raw data from various sources such as databases, web scraping, APIs, sensors, and more. The quality and relevance of the data collected are crucial for the success of any data science project.

Data Cleaning
Raw data often contains errors, missing values, and inconsistencies. Data cleaning, also known as data preprocessing, involves identifying and correcting these issues. Techniques such as imputation, normalization, and transformation are used to prepare the data for analysis.

Exploratory Data Analysis (EDA)
EDA involves analyzing the data to discover patterns, trends, and anomalies. This step helps data scientists understand the underlying structure of the data and generate hypotheses for further analysis. Visualization tools like matplotlib, seaborn, and Tableau are commonly used during this phase.

Data Modeling
Data modeling involves building mathematical models to represent the relationships within the data. This step often employs machine learning algorithms to create predictive models. Techniques like regression, classification, clustering, and deep learning are used depending on the problem at hand.

Model Evaluation and Validation
Once a model is built, it needs to be evaluated to ensure its accuracy and reliability. Techniques such as cross-validation, confusion matrix, ROC curve, and various performance metrics (e.g., precision, recall, F1 score) are used to assess the model's performance.

Deployment and Monitoring
The final step in the data science process is deploying the model into a production environment. This involves integrating the model with existing systems and ensuring it performs well with real-world data. Continuous monitoring is necessary to track the model's performance and update it as needed.

Tools and Technologies in Data Science
Programming Languages
Python: Known for its simplicity and extensive libraries such as NumPy, pandas, scikit-learn, and TensorFlow, Python is the most popular language in data science.
R: Preferred for statistical analysis and visualization, R has a rich ecosystem of packages like ggplot2, dplyr, and caret.
SQL: Essential for managing and querying relational databases, SQL is a fundamental skill for data scientists.
Data Visualization Tools
Tableau: A powerful tool for creating interactive and shareable dashboards.
Power BI: Microsoft's business analytics tool for visualizing data and sharing insights.
Matplotlib and Seaborn: Python libraries for creating static, animated, and interactive visualizations.
Machine Learning Frameworks
Scikit-learn: A comprehensive library for machine learning in Python, offering simple and efficient tools for data mining and data analysis.
TensorFlow and Keras: Open-source libraries for numerical computation and machine learning, particularly well-suited for deep learning applications.
PyTorch: A deep learning framework that provides flexibility and speed, popular in both academia and industry.
Applications of Data Science
Healthcare
Data science has revolutionized healthcare by enabling predictive analytics, personalized medicine, and improved patient care. Applications include disease prediction, medical image analysis, and drug discovery.

Finance
In the finance industry, data science is used for fraud detection, risk management, algorithmic trading, and customer segmentation. Financial institutions leverage data-driven insights to make informed decisions and improve operational efficiency.

Marketing
Data science helps marketers understand customer behavior, optimize campaigns, and enhance customer engagement. Techniques like sentiment analysis, customer segmentation, and recommendation systems are commonly used.

E-commerce
E-commerce platforms use data science for inventory management, pricing strategies, customer personalization, and fraud detection. Analyzing customer data helps businesses tailor their offerings and improve user experience.

Transportation
Data science plays a crucial role in optimizing routes, predicting maintenance needs, and improving safety in the transportation industry. Companies like Uber and Lyft use data-driven algorithms for dynamic pricing and efficient fleet management.

Challenges in Data Science
Data Privacy and Security
With the increasing amount of data being collected, ensuring data privacy and security is a significant challenge. Data breaches and misuse of personal information can have serious consequences.

Data Quality
The quality of data is critical for the success of data science projects. Inaccurate, incomplete, or biased data can lead to misleading insights and poor decision-making.

Skill Gap
There is a high demand for skilled data scientists, but the supply of professionals with the necessary expertise is limited. Bridging this skill gap requires continuous learning and training.

Ethical Considerations
Data scientists must consider ethical issues such as bias in algorithms, transparency, and accountability. Ensuring fair and unbiased outcomes is essential for maintaining trust and integrity in data science.

The Future of Data Science
The future of data science is promising, with advancements in artificial intelligence, machine learning, and quantum computing set to revolutionize the field. As data continues to grow exponentially, the need for data science will only increase. Emerging technologies like the Internet of Things (IoT) and edge computing will generate even more data, creating new opportunities and challenges for data scientists."""

In [None]:

Skill Gap
There is a high demand for skilled data scientists, but the supply of professionals with the necessary expertise is limited. Bridging this skill gap requires continuous learning and training.

Ethical Considerations
Data scientists must consider ethical issues such as bias in algorithms, transparency, and accountability. Ensuring fair and unbiased outcomes is essential for maintaining trust and integrity in data science.

The Future of Data Science
The future of data science is promising, with advancements in artificial intelligence, machine learning, and quantum computing set to revolutionize the field. As data continues to grow exponentially, the need for data science will only increase. Emerging technologies like the Internet of Things (IoT) and edge computing will generate even more data, creating new opportunities and challenges for data scientists."""

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenize=Tokenizer()

In [10]:
tokenize.fit_on_texts([text])

In [11]:
tokenize.word_index

{'data': 1,
 'and': 2,
 'the': 3,
 'science': 4,
 'of': 5,
 'for': 6,
 'to': 7,
 'is': 8,
 'in': 9,
 'a': 10,
 'analysis': 11,
 'as': 12,
 'learning': 13,
 'this': 14,
 'such': 15,
 'used': 16,
 'like': 17,
 'with': 18,
 'tools': 19,
 'machine': 20,
 'are': 21,
 'customer': 22,
 'algorithms': 23,
 'techniques': 24,
 'involves': 25,
 'scientists': 26,
 'insights': 27,
 'it': 28,
 'has': 29,
 'technologies': 30,
 'visualization': 31,
 'step': 32,
 'model': 33,
 'python': 34,
 'field': 35,
 'that': 36,
 'systems': 37,
 'from': 38,
 'statistics': 39,
 'predictive': 40,
 'analytics': 41,
 'big': 42,
 'driven': 43,
 'can': 44,
 'be': 45,
 'commonly': 46,
 'by': 47,
 'databases': 48,
 'various': 49,
 'computing': 50,
 'quality': 51,
 'helps': 52,
 'deep': 53,
 'performance': 54,
 'e': 55,
 'ensuring': 56,
 'libraries': 57,
 'skill': 58,
 'creating': 59,
 'applications': 60,
 'industry': 61,
 'management': 62,
 'comprehensive': 63,
 'an': 64,
 'fields': 65,
 'mining': 66,
 'increasing': 67,
 '

In [14]:
word_length=len(tokenize.word_index)+1

In [15]:
word_length

459

In [16]:
sent=text.split("\n")

In [65]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [66]:
token_sequence=[]
for line in sent:
    token_list=tokenize.texts_to_sequences([line])[0]
    for i in range(len(token_list)):
        n_gram=token_list[:i+1]
        print(n_gram)
        token_sequence.append(n_gram)


[3]
[3, 63]
[3, 63, 142]
[3, 63, 142, 7]
[3, 63, 142, 7, 1]
[3, 63, 142, 7, 1, 4]
[143]
[1]
[1, 4]
[1, 4, 8]
[1, 4, 8, 64]
[1, 4, 8, 64, 144]
[1, 4, 8, 64, 144, 35]
[1, 4, 8, 64, 144, 35, 36]
[1, 4, 8, 64, 144, 35, 36, 145]
[1, 4, 8, 64, 144, 35, 36, 145, 146]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 37]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 37, 7]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 37, 7, 149]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 37, 7, 149, 150]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 37, 7, 149, 150, 2]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 37, 7, 149, 150, 2, 27]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 37, 7, 149, 150, 2, 27, 38]
[1, 4, 8, 64, 144, 35, 36, 145, 146, 147, 148, 23, 2, 3

In [26]:
length=[]
for i in token_sequence:
    length.append(len(i))


In [31]:
max_length=max(length)

In [32]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
paded_token_sequence=pad_sequences(token_sequence,maxlen=max_length,padding='pre')

In [34]:
paded_token_sequence

array([[  0,   0,   0, ...,   0,   0,   3],
       [  0,   0,   0, ...,   0,   3,  63],
       [  0,   0,   0, ...,   3,  63, 142],
       ...,
       [  0,   0,   0, ...,   2, 135,   6],
       [  0,   0,   0, ..., 135,   6,   1],
       [  0,   0,   0, ...,   6,   1,  26]])

In [36]:
x=paded_token_sequence[:,:-1]
y=paded_token_sequence[:,-1]

In [43]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense,Embedding

In [41]:
y = to_categorical(y,num_classes=word_length)

In [42]:
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [78]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=word_length, output_dim=100),
    tf.keras.layers.LSTM(150),
    tf.keras.layers.Dense(word_length, activation='softmax')
])

In [79]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [80]:
model.build(input_shape=(None, max_length-1))

In [81]:
model.summary()

In [82]:
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(
    monitor='val_accuracy',
    patience=3,
    restore_best_weights=True
)

history = model.fit(
    x, y,
    validation_split=0.2,  # 20% of data for validation
    epochs=200,
    batch_size=32,
   
)

Epoch 1/200
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 61ms/step - accuracy: 0.0468 - loss: 6.0263 - val_accuracy: 0.0631 - val_loss: 5.6493
Epoch 2/200
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 40ms/step - accuracy: 0.0595 - loss: 5.3766 - val_accuracy: 0.0777 - val_loss: 5.9379
Epoch 3/200
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 39ms/step - accuracy: 0.0675 - loss: 5.3447 - val_accuracy: 0.1019 - val_loss: 6.1361
Epoch 4/200
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 40ms/step - accuracy: 0.0859 - loss: 5.2603 - val_accuracy: 0.0971 - val_loss: 6.1994
Epoch 5/200
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 40ms/step - accuracy: 0.1077 - loss: 5.1507 - val_accuracy: 0.1117 - val_loss: 6.2257
Epoch 6/200
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 40ms/step - accuracy: 0.0855 - loss: 5.1143 - val_accuracy: 0.1262 - val_loss: 6.3001
Epoch 7/200
[1m26/26[0m [